Independence versus Indetermination: basis of two canonical clustering criteria
Pierre Bertrand, Michel Broniatowski, Jean-François Marcotorchino
AAdvances in Data Analysis and Classification manuscript No. (will be inserted by the editor)
Statistical Independence versus LogicalIndetermination, two ways of generating clusteringcriteria through couplings : Application to graphsmodularization
P. Bertrand · M. Broniatowski · J.-F. Marcotorchino
Received: date / Accepted: date
Abstract
This paper aims at comparing two coupling approaches as basic lay-ers for building clustering criteria, suited for modularizing very large graphs.Although the scientific literature is not sparing with clustering criteria dedi-cated to graphs and networks decomposition, we shall nevertheless rework thissubject, in this paper, by proposing a new symmetric and dual approach basedon coupling functions, allowing to compare and calibrate them. To elaboratethose coupling maps, we will briefly use optimal transport theory as a startingpoint, then we will derive two main families of criteria: those based upon sta-tistical independence versus those based upon logical indetermination. Amongothers, we will use the so called Monges properties, applied to contingency ma-trices context, as specific tricks for putting forward some key features aboutthose criteria. A further and deeper study is proposed, highlighting logicalindetermination, because it is, by far, lesser known. Those dual and parallelcriteria are perfectly suited for graphs clustering, this will be illustrated andshown on various types of graphs within this paper.
Keywords
Correlation Clustering · Mathematical Relational Analysis · Logical Indetermination · Coupling Functions · Optimal Transport · GraphTheoretical Approaches
P. BertrandLaboratoire de Probabilits, Statistique et Modlisation — CNRS UMR 8001, Sorbonne Uni-versit, Paris, FranceE-mail: [email protected]. BroniatowskiLaboratoire de Probabilits, Statistique et Modlisation — CNRS UMR 8001, Sorbonne Uni-versit, Paris, FranceE-mail: [email protected]. MarcotorchinoInstitut de statistique, Sorbonne Universit, Paris, FranceE-mail: [email protected] a r X i v : . [ c s . D M ] J u l P. Bertrand et al.
As mentioned in the abstract, this paper introduces two coupling approachesas basic layers for building clustering criteria, suited for modularizing verylarge graphs.Graph clustering (or cliques partitioning of graphs) is a key topic, con-cerned with a very large dedicated literature. One of the reasons of this sta-tus is the recent and power use made by the GAFAM companies about verylarge graphs resulting of modern activities dealing with: big social networks,cellphone communications networks, high speed financial trading, large IT net-works, IOT networks etc..,
This is simultaneously associated with the IT ca-pacity afforded today to store the really huge amounts of data, those activitiesforce us to cope with. The sudden apparition of these big networks gave riseto a renewal of the so-called graph theoretical domain , used in that contextfor different purposes, such as: discovering the latent cliques, clustering thewhole graph, isolating some key parts of interest within the network, etc. Inother words, this massive and raw information contained inside the networksmust be analyzed per se , and this leads obviously to mandatory techniques,among which graphs clustering plays a prominent role, with a lot of practicalcontextual applications.At that stage two aspects must be differentiated : on the one hand (i)the existence of generic algorithms dealing with various clustering criteria asglobal objective functions, which can be changed according to the context, wewant to address, or, on the other hand (ii), on the graph clustering criteriathemselves, as soon as we must choose some of them as global objective func-tions, during the network analysis step. Both those points will be discussedthroughout this article, although we shall theoretically insist essentially on thesecond point (ii).Going back on the first point, (i) concerned with generic algorithms , itis well known that several methods were introduced to fit this purpose andnotably the famous
Louvains algorithm , whose origin is quite recent [4], andwhich is recognized as a very good tool by the scientific community. It is basedupon the optimization, (through some ad-hoc heuristics), of a global functioncalled modularity , (we shall discuss this notion later on). In few words let ussay that the global optimization is obtained iteratively by optimizing a localcost function: where two vertices are said to be similar if they are connectedaccording to a weight which sufficiently differs from the mean weight of theirneighborhood. The cost function, as we will see later, is built on the departurefrom the usual independence coupling function. The method has been naturallygeneralized in [6] where the authors proposed to choose a candidate criterion among a list of global criteria, different from the usual modularity . In her thesis[7], Patricia Conde-Cespedes, proposed some experiments on usual graphs,involving our criteria plus some others, showing that results may vary from onecriterion to another, while being still consistent and interpretable. The processshe has performed is exactly the fusion of (i)-(ii) in the same design. This,as already mentioned beforehand, is obtained by using the same algorithmic itle Suppressed Due to Excessive Length 3 process (in that case the generic Louvains one) applied with different clusteringcriteria . It is interesting to notice that the resulting numbers of classes (whichis not fixed in advance by the method on the contrary of the k-means approach)were coherent and comparable from one criterion to another for most of thestudied graphs.To fulfill our (ii) objective, we will focus, in this paper, on two graph clus-tering criteria, the first one, quite classic and largely used, is called: modularity ( M × ) (a kind of measure of the deviation from statistical independence ), whilethe second ( M + ) is locally based on a deviation to another coupling function,already latent in a paper of Frchet [9] and that we shall call indetermina-tion or logical indetermination (notion introduced by J.F. Marcotorchino inhis seminal papers [14] and [17]). Here we propose a theoretical approach tounderstand the behavior of both those criteria. The function ( M + ) has beenalready tested by Patricia Conde-Cespedes in her thesis [7] on peculiar graphs.We shall replicate, here, the experimental results she got, but we want bothreanalyze more systematically the behavior of those criteria on the very simplemodel of Erds-Renyis random graphs and bring much more solid bases to thetheoretical interpretation of these chosen criteria ( M × ) and ( M + ).To express similarities between them, we will conduct a deep analysis ofthe two underlying coupling functions: the well-known independence (usuallyquoted with a ⊗ ) and the more recent indetermination (quoted with a ⊕ ) thatwe shall introduce later on.The paper is structured as follows. In section 2 we propose a parallel dis-covery of two coupling functions ( ⊗ ) and ( ⊕ ) using discrete optimal transporttheory . In section 3 is mentioned a list of dual properties related to Monge’smatrices. Section 4 deeply studies indetermination introducing properties that,to our knowledge, deserve to be put forward with regards to the too poor cov-erage which is devoted to them in the literature. Finally, Section 5 gathers astudy about the behavior of the criteria based on those coupling functions onthe general Erds-Renyis random graph model, quoting similarities and differ-ences through specific graphs. When we want to couple two marginal laws, the most common and straight-forward way to proceed, consists in assuming independence and keep on com-putations. It is so well integrated in our mindset, that it appears naturally inreal life applications, as soon as we want to build fast models up. In our sci-entific work, the approach is quite the same: when we use a very classical andusual criterion like the χ index, we are measuring nothing but a deviation to independence .Thinking about how we first introduced independence , we immediately sug-gest empiric experiences: let us say if we play a dice twice, how should we derivethe resulting probabilities from a unique dice? Most of us will naturally apply independence coupling: it relies on empirical experiments. P. Bertrand et al.
Although being the most natural, it is not, by far, the only existing avail-able coupling method; actually, as introduced by Sklar in [23], any copulafunction will lead to a coupling function behaving on two cumulative distri-bution functions. In this document, we link a coupling function to a givenoptimal transport problem. Hence, to follow a similar approach for indetermi-nation coupling, we train ourselves first by extracting independence couplingfrom the optimization of a transport problem and we generalize the principleby applying the same approach to the indetermination case, but with a secondand different transport problem.We already introduced the term ”coupling function” several times but letus define it formally, since it will be a key notion throughout the document.
Definition 1 (Coupling function)
Given µ = µ . . . µ p and ν = ν . . . ν q two discrete probabilities called marginaldistributions (or simply margins), we want to define a probability function π = π u,v { ≤ u ≤ p, ≤ v ≤ q } on the product space. A way for building itup, consists in making happen a coupling function C such that π = C ( µ, ν ) ,satisfying the following constraints: – (first margin) C ( µ, ν ) u, · = (cid:80) qv =1 C ( µ, ν ) u,v = µ u , ∀ ≤ u ≤ p – (second margin) C ( µ, ν ) · ,v = (cid:80) pu =1 C ( µ, ν ) u,v = ν v , ∀ ≤ v ≤ q – (positivity) C ( µ, ν ) u,v ≥ , ∀ ≤ u ≤ p, ∀ ≤ v ≤ q Remark 1
All coupling functions (or maps) we use will satisfy: π u,v = C ( µ, ν ) u,v = C ( µ u , ν v ) ; this illustrates that π value on ( u, v ) only depends upon the valueon the corresponding margins: µ u and ν v . C has to respect, lead us to copewith some difficulties. This is the reason why we shall choose a systematicapproach: minimizing a cost function and observe the link to optimal transport definition.The ad-hoc discrete optimal transport problem we will be dealing with,typically looks like Problem 2, given hereafter (where MKP stands for Monge-Kantorovitch-Problem).Before to introduce Problem 2 in detail, let us go back to the histori-cal problem (here quoted Problem 1). It is, in fact, the merit of the FrenchMathematician Gaspard Monge to have been the first to address, in 1781,the problem, known as Problem of ” Remblais et dblais ”. This problem canbe simply turned as follows: what is the most efficient way (in terms of work itle Suppressed Due to Excessive Length 5 or minimization effort) to move a pile of sand to fill up an excavation of thesame volume? This constraint of volume incompressibility makes the problemdifficult.
Problem 1 (Original Monge Problem) min T (cid:90) X C ( x, T ( x )) d x Using modern notations, a ”sand pile” is represented by a probability mea-sure µ ∈ P ( X ) and a ”hole to fill up” by a probability measure ν ∈ P ( Y ). Thoseprobability measures correspond to the margins of Definition 1. They are stillin a continuous space, as we follow historical introduction, but we will comeback later on to a discrete space. A priori holes have the same volumes assand piles do, this implies: 0 < µ ( X ) = ν ( Y ) = 1Let us also give a continuous transportation cost function C : X ×Y → R + .A solution to this Problem 1 (if any) is called an optimal transport mapor a Monges solution. Let us remark that transport maps from µ to ν maynot exist; for instance, this is the case if µ is a Dirac δ a at point a whereas ν is not. But on a more general standpoint, one should also remark that theMonge’s formulation is quite rigid in the sense that it requires that the wholemass of x in X should be assigned to the same target T ( x ) (no split is permit-ted). According to the difficulties of the Monge’s problem, as commonly metin hard problem solving, the solution resides in the extension or relaxationof the research domain itself. It is exactly what happened to Monge’s prob-lem: in 1942, Leonid Kantorovich (Nobel Prize of Economy 1975) proposeda relaxed formulation of the Monge’s problem that allows mass splitting; adiscrete version of Problem 2 is presented below: Problem 2 (Discrete Version of MKP) min π p (cid:88) u =1 q (cid:88) v =1 C ( π ( u, v )) π ( u, v )subject to: q (cid:88) v =1 π ( u, v ) = µ u ; ∀ u ∈ { , ..., p } p (cid:88) u =1 π ( u, v ) = ν v ; ∀ v ∈ { , ..., q } π ( u, v ) ≥ ∀ ( u, v ) ∈ { , ..., p } × { , ..., q } The choice of a cost function C depends upon the applications we wantto address. For instance, we can force the result π to concentrate as littleinformation as possible, this means, we shall force it to be as close as possible P. Bertrand et al. to the uniform law, referring to the product space (remember it has to verifythe given margins). Other choice: we can, as well, minimize the entropy of π . Both those cases are usual approaches, introduced in some articles. Theyexpect the global assignment to be as smooth as possible.A MKP problem is essentially given by its cost function, while margins( µ, ν ) may vary. This is the reason why we shall try to solve it with a modeltaking the fixed margins as parameters. Let us define now an optimal couplingfunction C associated to a given MKP problem with fixed margins given asparameters. Definition 2 (MKP Problem Associated with Coupling function)
For a given MKP problem P , we can define a coupling function C P by: C P ( µ, ν ) = π ∗ ( P ) provided that π ∗ exists as a unique solution of P with margins µ and ν . Following Definition 2 we propose the solutions of two discrete optimaltransport problems that we shall use in section 5: each implies a structuredand well-defined criterion, suitable for graph clustering.2.2 The Alan Wilson’s Entropy Model: role of ”independence”First introduced by Sir Alan Wilson in 1969 for ”Spatial Interaction Model-ing” the ”Flows Entropy Model” of Alan Wilson, can be found in his variouspublications: originated in [26], developed in [27], and refined in his book [28].A fundamental justification of his approach corresponds to the following con-textual situation: in a theoretical system, elements of which do not maintainaffinities, it is advisable to determine the distribution of π ( u, v ) (normalizedfrequency flows), supposing π ≥ Boltzmann’s or Shannon’s Entropies so that the problem should beexpressed as follows:
Problem 3 (Unbalanced PSIS) min π − p (cid:88) u =1 q (cid:88) v =1 π ( u, v ) ln( u, v )In a situation where we have a total absence of information, the minimiza-tion of Problem 3 just amounts to satisfy the constraint that the cell valuesdistribution is effectively a probability (i.e.: the sum of positive π ( u, v ) is equalto 1 (summing up a joint probability distribution). The solution of this verysimple ”Program of Spatial Interaction System” (PSIS) can be expressed asfollows. π ∗ ( u, v ) = 1 pq (1) itle Suppressed Due to Excessive Length 7 In other words, when we ignore everything about the way the exchangesare built up, it is necessary to use Laplace’s principle of ”insufficient reason”and to consider that the world trade is uniformly distributed inside the system.By using margins, let us say information about total exports (origins flows)and total imports (destination flows), degree of disorder of the system can bedrastically reduced. Indeed, totals on rows and columns are no longer free, butmust satisfy marginal values µ u and ν v , fixed by the application as expressedin Problem 4; solution of which is given by theorem 1. Problem 4 (Balanced PSIS) min π − p (cid:88) u =1 q (cid:88) v =1 π ( u, v ) ln( π ( u, v ))subject to: q (cid:88) v =1 π ( u, v ) = µ u , ∀ ≤ u ≤ p p (cid:88) u =1 π ( u, v ) = ν v , ∀ ≤ v ≤ q ≤ π u,v ≤ , ∀ ≤ u, v ≤ q Theorem 1
The solution of Problem 4 is π × ( u, v ) = µ u ν v .Hence the coupling function associated to Problem 4 is nothing but ”inde-pendence”: C P roblem ( µ, ν ) u,v = C × ( µ, ν ) u,v = ( µ ⊗ ν ) u,v = µ u ν v We skip the proof of theorem 1 as it is similar to the one we will developfor theorem 2 which is less common.As a conclusion, from the direct maximization of entropy, we get the solu-tion expressed in terms of probability and remark that the associated couplingfunction is nothing but ”independence” (expressed with a ⊗ throughout thedocument). We also note that the degree of disorder is not total: flows possessan intensity which is proportional to the weights of the partners in the worldtrade exchanges matrix in case of an economic application.2.3 The minimal trade model: role of ”indetermination”In the ”Minimal Trade Model” (see [25], [14] and [17]), we still impose theobjective function to respect the balanced marginal distributions and masspreserving constraints but we change its structure for getting a smootherbreakdown of the origins-destinations n uv values than in the Alan Wilsonsentropy model (this explains the term ”Minimal Trade”). We still suppose P. Bertrand et al. π ( u, v ) = n u,v n · , · , for any real application. In that case the criterion is a quadraticfunction measuring squared deviation of the cells values from the ”no informa-tion” situation (the uniform joint distribution law related to Problem 3). As ex-pected, in case of free margins, the solution remains the uniform law. Though,adding usual pre-conditioned constraints on margins, the least squared prob-lem is Problem 5; solution of which is given by theorem 2. Problem 5 (Minimal Trade Model) min π (cid:88) u,v (cid:26) π ( u, v ) − pq (cid:27) subject to: q (cid:88) v =1 π ( u, v ) = µ u , ∀ ≤ u ≤ p p (cid:88) u =1 π ( u, v ) = ν v , ∀ ≤ v ≤ q ≤ π u,v ≤ , ∀ ≤ u, v ≤ q Theorem 2
The solution of Problem 5 is π + ( u, v ) = µ u q + ν v p − pq .Hence the coupling function associated to Problem 5 is nothing but ”inde-termination”: C P roblem ( µ, ν ) u,v = C + ( µ, ν ) u,v = ( µ ⊕ ν ) u,v = µ u q + ν v p − pq A supplementary condition, which is exogenous with regard to the previousmodel, can be added on the margins (which are, by the way, constant valuesgiven a priori ), this condition (see [14]) is a simple inequality which guaranteesthe positivity of the frequency Matrix π ∗ ( u, v ) we are looking for: p min u µ u + q min v ν v ≥ µ u and ν v is. Furthermore since the Matrix π ( u, v ) rep-resents ”frequencies”, the last constraint of Problem 5 is playing a role ofsupplementary endogenous constraint, ensuring: π ( u, v ) ≤
1. Notice that inthe ”Adjustment to Fixed Margins for Contingency Table” case, the associ-ated values n uv must be integers, and therefore returns the problem muchmore complex to solve, relaxation of this integrity constraint leads formally tothe Problem 5. Remark 2 (Vanishing bias)
By developing the cost function, we obtain an interesting equality we will reuse itle Suppressed Due to Excessive Length 9 later on: (cid:88) u,v (cid:18) π ( u, v ) − pq (cid:19) = (cid:88) u,v π ( u, v ) − pq (3) so that the influence of the constant shift pq in the squared model is disappear-ing.Proof The proof we propose directly comes from [25] and [17]. A generalization ofthe canonic additive form when we relax hypothesis 2 can be found in thethesis to come [3].Using equality 3, the Lagrangian function associated to the previous min-imization model can be turned into L ( π, λ, ω, θ ) = p (cid:88) u =1 q (cid:88) v =1 π ( u, v ) − p (cid:88) u =1 λ u (cid:32) µ u − q (cid:88) v =1 π ( u, v ) (cid:33) − q (cid:88) v =1 ω v (cid:32) ν v − p (cid:88) u =1 π ( u, v ) (cid:33) − θ (cid:32) p (cid:88) u =1 q (cid:88) v =1 π ( u, v ) − (cid:33) Since the function to optimize is a convex one, the solution we are lookingfor is a minimum so that first order conditions apply and we have the followingsystem of equations. ∂ L ( π, λ, ω, θ ) ∂π ( u, v ) = 2 π ( u, v ) − λ u − ω v − θ = 0 (4) ∂ L ( π, λ, ω, θ ) ∂λ u = µ u − q (cid:88) v =1 π ( u, v ) = 0 (5) ∂ L ( π, λ, ω, θ ) ∂ω v = ν v − p (cid:88) u =1 π ( u, v ) = 0 (6)When supposing (cid:80) v ω v = 0 as Lagrange multipliers are defined within aconstant near we sum 4 on v to obtain 2 µ u = (cid:80) v π ( u, v ) = qλ u + qθ so that λ u + θ = 2 µ u q , ∀ u (7)From 6 we get 2 ν v = (cid:80) pu =1 π ( u, v ) = (cid:80) pu =1 λ u + ω v + θ = (cid:80) pu =1 2 q µ u + ω v = q µ u + pω v so that ω v = 2 ν v p − pq , ∀ v (8) Replacing into 4 λ u + θ and ω v by their value given respectively by 7 and8 we obtain: π ∗ ( u, v ) = µ u q + ν v p − pq , ∀ ( u, v )Remark, since Condition 2 applies, the π ∗ expressed in the previous equa-tion are nonnegative. We will go back to this expression, in the next sectionsand develop a deeper focus on it, explaining the true meaning of the term”indetermination” and some other consequences. Remark 3 (Sum of uniform shift)
We notice that π u,v = µ u q + ν v p − pq can be expressed as π u,v − pq = q (cid:16) µ u − p (cid:17) + p (cid:16) ν v − q (cid:17) so that indetermination basically sums up the distances to unifor-mity for each margin. µ and ν followthe Dirichlet’s law (basically the uniformity on probability distributions). Weremind here the form of that law for our application. Definition 3 (Dirichlet’s Law)
The density of a Dirichlet law D p representing a uniform law among probabilitylaw on p elements is expressed as follows: f ( µ , ..., µ p ) p (cid:89) k =1 d µ k = 1 B ( p ) p (cid:89) k =1 µ k p (cid:89) k =1 d µ k = 1 B ( p ) p (cid:89) k =1 d µ k where B is the multinomial Beta function. Having expressed a density function for µ and ν (replace p by q ), we applythem two coupling functions C + and C × . As a distance, we define: ∆ p = E ( µ,ν ) ∼D p ⊗D q (cid:34) p (cid:88) u =1 q (cid:88) v =1 [( µ ⊗ ν ) u,v − ( µ ⊕ ν ) u,v ] (cid:35) and compute its value through the sequence: ∆ p = E ( µ,ν ) ∼D p ⊗D q (cid:34) p (cid:88) u =1 q (cid:88) v =1 (cid:20) ( µ u − p )( ν v − q ) (cid:21) (cid:35) = E µ ∼D p (cid:34) p (cid:88) u =1 ( µ u − p ) (cid:35) E ν ∼D q (cid:34) q (cid:88) v =1 ( ν v − q ) (cid:35) = pq E µ ∼D p (cid:20) ( µ − p ) (cid:21) E ν ∼D q (cid:20) ( ν − q ) (cid:21) itle Suppressed Due to Excessive Length 11 Now, we notice that we need to compute the variance of D p ; as it is aknown law, we use the following property: Proposition 1 (Variance of Dirichlet law) V X ∼D p [ X ] = p − p ( p +1) Proposition 1 in particular, implies that margins will concentrate theirvalues around p and q respectively as soon as p or q increases respectively.As we notice that couplings equal each other when any margins is uniform,this should imply that ∆ p converges to 0 if any of the two increases. This isexactly what happens, we have the expression: ∆ p = 1 pq (cid:18) p − p + 1 · q − q + 1 (cid:19) ≤ pq This last inequality confirms what was expected: as margins are concen-trated around their means, the two couplings tend to be equal rapidly if p or q increases.2.5 Structural Justification based upon an axiomatic result of Imre CsiszarAlthough it seems arbitrary, our restriction to these two previous couplingfunctions, is all but a fortuitous decision: in [8], Csiszar actually shows that,provided we verify additional intuitive properties, we must restrict ourselvesto use either least square or maximum entropy as canonic ”distances” betweenprobability distributions.Let us rewrite our transport problems in terms of the notations he usesin [8]. We notice that problems 4 and 5 aims at reducing a distance from π tothe uniform law (that term actually vanishes in both), where π must satisfyconstraints on its margins leading to an eligible space L µ,ν inside the simplex S n . In the first problem, the distance function is the entropy while in thesecond it is the norm L .A general question is how to adapt a ”prior guess” u to verify a list ofconstraints. Let us say u lives in S n while the given constraints define asubspace L ∈ L ( L is the space of subspaces of S n tuned by a finite list ofaffine constraints, see [8] for more details). To formalize it, Csiszar defines a projection rule Π as a function whose input is a set L ∈ L and which generatesa method Π L to project any prior guess u to a vector in L : Π : L → ( S n → S n ) L → Π L : ( u → Π L ( u ) ∈ L )The article then introduces a collection of ”natural” properties that wegather hereafter. – consistency : if L (cid:48) ⊂ L and Π L ( S n ) ⊂ L (cid:48) then Π L (cid:48) = Π L ; basically, if theresult of a projection to a bigger space is always inside a smaller, then theprojection on the two spaces are equivalent. – distinctness : if L and L (cid:48) are defined by a unique constraint and they are notequal, then Π L (cid:54) = Π L (cid:48) (unless they both contains the initial prior guess).Typically, in R , minimizing || · || on two lines returns a different result assoon as they do not both contain 0. – continuity : Π is continuous with regards to L ∈ L ; it has a continuousrelation with constraints. – scale invariant : Π λL ( λu ) = λu for any positive λ and any u ∈ S n . – local : for any subset J ⊂ { , . . . , n } , ( Π L ) J = ( Π L (cid:48) ) J as soon as L J = L (cid:48) J where L J means we only keep constraints dealing with coordinates in J and( Π L ) J is the restriction of the resulting vector of Π L to the J coordinates.This property indicates that the results of Π on a set of coordinates, onlydepends on constraints applied to those coordinates. – transitive : for any L (cid:48) ⊂ L , Π L (cid:48) = Π (cid:48) L ◦ Π L . We can first project on a biggerspace without affecting the result.All those properties appear as a must-have for defining a convenient pro-jection rule. The main result of the paper [8] is that if Π is satisfying theircombination then it is limited to two forms: – Π L : u → argmin v ∈ L (cid:2)(cid:80) ni =1 α i ( v i − u i ) (cid:3) for a fixed vector α – Π L : u → argmin v ∈ L [ (cid:80) ni =1 α i h β ( v i | u i )] for a fixed vector α with h β beingspecific functions defined in the paper and which are equal to the entropyin the case β = 1We already basically know that any convenient projection is coined out of L projections or entropy-like h β functions. Adding a last property, similar tothe Full Monge or Full Log Monge conditions that we introduce in section 3restrict to α = β = 1, hence to the two problems we treated in this document.This last property guarantees that the ”no interaction” solution in case weomit constraints (as the one of problem 3) respects a proportional behavior.Namely, that if we update the total mass available (for instance in a monetaryapplication), the resulting effect will be proportional on each component.To come back to our transport problem, the ”prior guess” is the uniformlaw while the subspace L ⊂ S n is defined using the margin constraints forcedby µ and ν . Then, provided we verify quoted properties, the two cost functionswe used cover an exhaustive view.2.6 Conclusion deduced from the Optimal Transport overviewUsing the generic formalism of ”optimal transport”, we found out two dualcoupling functions. The first one ”independence” is well-known while the sec-ond introduces the so-called ”indetermination”, which follows a dual sequenceof properties induced by the use of sums rather than products; we shall givefurther details on that point. In section 4 we present some highlights on thespecific properties of ”indetermination” and study it per se . Now, let us keepon the parallel between those twins coupling functions in section 3 by intro-ducing some properties on their corresponding contingency (or probability)matrices; leading to the ⊕ notation. itle Suppressed Due to Excessive Length 13 ⊕ / ⊗ notation We introduce two classes of matrices, the first one is attributed to GaspardMonge, from a basic idea appearing in his 1781 paper, (incidentally see[5],where a reference is given to Alan Hoffman who first coined that pointand consequently proposed the name: Monge’s Matrices). For each of thoseMonge’s matrices, we point out some remarkable equalities and, moreover,we link them to a corresponding coupling function. Doing so, we derive newproperties on each of the two coupling functions we introduced in section 2.3.1 Monge property – ”Indetermination”To introduce Monges properties, we follow the exhaustive work of RainerBurkard, Bettina Klinz and Rdiger Rudolf exposed in the 66-pages-long arti-cle [5] and begin with definition 4. Definition 4 (Monge and Anti-Monge matrix) A p × q real matrix c u,v is said to be a Monge matrix if it satisfies: c u,v + c u (cid:48) ,v (cid:48) ≤ c u (cid:48) ,v + c u,v (cid:48) ∀ ≤ u ≤ u (cid:48) ≤ p, ≤ v ≤ v (cid:48) ≤ q and an Anti-Monge matrix if: c u,v + c u (cid:48) ,v (cid:48) ≥ c u (cid:48) ,v + c u,v (cid:48) ∀ ≤ u ≤ u (cid:48) ≤ p, ≤ v ≤ v (cid:48) ≤ q Remark 4 (Full-Monge matrix)
The important case for our purpose is the equality case when a matrix is bothMonge and Anti-Monge, we will call this situation ”Full-Monge” matrix. c u,v + c u (cid:48) ,v (cid:48) = c u (cid:48) ,v + c u,v (cid:48) ∀ ≤ u ≤ u (cid:48) ≤ p, ≤ v ≤ v (cid:48) ≤ q Although it is poorly studied, the last introduced equality fits perfectlyour purpose. The inequalities on the contrary, are common and can be met indiverse situations such as cumulative distribution functions, or copula theory.
Remark 5 (Adjacent cells)
A straightforward but important derived property is the local adjacency cellsequality: it is sufficient to satisfy the property of the remark 4 on adjacent cells,to ensure the obtainment of a ”Full-Monge” matrix behavior for the global setof cells i.e.: c u,v + c u +1 ,v +1 = c u +1 ,v + c u,v +1 ∀ ≤ u ≤ p, ≤ v ≤ q In 1961 Alan Hoffman (IBM Fellow and US Science Academy member) rediscoveredMonges’s observation see [13]. Hoffman showed that the HitchcockKantorovich transporta-tion problem can be solved by a very simple approach if its underlying cost matrix satisfiesthose Monge’s properties4 P. Bertrand et al.3 4 2
99 13 5 27 / /
27 2 / /
27 1 / / /
27 2 /
27 0 / /
27 2 / Fig. 1
Example of an indetermination coupling (Statistical counting vs Probability forms)
Remark 5 is a key property to study Monge matrices since it gives a direct O ( pq ) algorithm to verify if a matrix is Monge.Besides, a question emerges: which density function verifies the full Mongeproperty? The following Proposition 2 gives an interesting answer: all fullMonge’s matrices derive from the density of an ”indetermination” structure. Proposition 2 (Full-Monge matrix is equivalent to ”Indetermination”)
A ”full Monge matrix” necessarily represents an ”indetermination coupling”.Proof
Summing on u and v the equality of remark 4 we straightforwardly obtain: (cid:88) u (cid:48) (cid:88) v (cid:48) c u,v + c u (cid:48) ,v (cid:48) − c u (cid:48) ,v − c u,v (cid:48) = pqc u,v + c · , · − qc · ,v − pc u, · = 0 → c u,v = c u, · q + c · ,v p − c · , · pq By summarizing properties of Full-Monge Matrices we get the followingTheorem 3.
Theorem 3 (Full-Monge matrices)
The π u,v cell values representing a probability matrix then the following prop-erties are equivalent.1. π is a Full-Monge matrix2. π u,v = π + u,v = µ u q + ν v p − pq π optimizes problem 5 for some given margins4. All × sub-tables { u, v, u (cid:48) , v (cid:48) } extracted from π have the same sum ontheir diagonal and anti-diagonal Last property of Theorem 3 is illustrated on Figure 1 and justifies the ⊕ notation assigned to ”indetermination”. Indeed, if we take blue and red arrowswe get the same resulting value: 0. Using the contingency form:blue arrows : 3 + 2 − − − − itle Suppressed Due to Excessive Length 15 / / / / / / / /
99 13 5 27 / /
81 5 / /
27 26 /
243 10 / /
27 13 /
243 5 / / /
81 5 / Example of an ”independence coupling” (Contingency vs Probability forms)
Definition 5 (Full-Log-Monge Matrices)
A strictly positive p × q matrix c u,v is ”Full-Log-Monge” when: ln( c u,v ) + ln( c u (cid:48) ,v (cid:48) ) = ln( c u (cid:48) ,v ) + ln( c u,v (cid:48) ) ∀ ≤ u ≤ u (cid:48) ≤ p, ≤ v ≤ v (cid:48) ≤ q To immediately get the correspondence, we propose a transposition froma property to another using logarithm in Remark 7. It supposes matrices tobe strictly positive (for our probability application: whole discrete space mustbe reached).
Remark 6 (From Log-Monge to Monge)
We easily verify that c satisfies condition proposed in definition 5 if and only if ln( c ) verifies the equivalent condition in definition 4 where logarithm is takenelement-wise. Using Remark 6, we can check that Full-Log-Monge property leads to inter-esting results and is linked to ”independence coupling”; without detailing theirobtainment, we gather those results within Theorem 4, dual of Theorem 3.
Theorem 4 (Full-Log-Monge Matrices)
Let π u,v be a strictly positive probability matrix then all the following state-ments are equivalent.1. π u,v is Full-Log-Monge2. π u,v = π × u,v = µ u ν v π optimizes problem 34. All × sub-tables { u, v, u (cid:48) , v (cid:48) } extracted from π have the same product ontheir diagonal and anti-diagonal. Figure 2 illustrates ”Full Log-Monge” matrices and their properties relatedto ”independence”; it justifies the usual ⊗ notation.In these matrices cell values are fractions; we want them to fulfil the samemarginal values as those given in Figure 1. It is important to remark that both those matrices (in Figure 1 and Figure 2) optimize a problem where the uniquedifference is the cost functions (since the margins are strictly identical). Weimmediately verify the last property of Theorem 4:blue arrows : 3 ∗ / − ∗ / / ∗ / − / ∗ / Monge transport problemSpatial Interaction Model
Based upon AlanWilson Entropymin π (cid:88) uv − π u,v ln( π u,v ) Minimal Trade Model
Squared deviation fromLaplace Insufficient Rea-son principle solutionmin π (cid:88) u,v (cid:18) π u,v − pq (cid:19) Margin ConstraintsMaurice Frchet q (cid:88) v =1 π ( u, v ) = µ u , ∀ ≤ u ≤ p p (cid:88) u =1 π ( u, v ) = ν v , ∀ ≤ v ≤ q ≤ π u,v ≤ , ∀ ≤ u, v ≤ q Independance π × u,v = µ u ν v Indetermination π + u,v = µ u q + ν v p − pq Full Log Monge Matrix π u,v π u (cid:48) ,v (cid:48) = π u,v (cid:48) π u,v (cid:48) Full Monge Matrix π u,v + π u (cid:48) ,v (cid:48) = π u,v (cid:48) + π u,v (cid:48) Fig. 3
View of the symmetry independence / indetermination
We propose here some concluding remarks about the parallel definitionsand properties our coupling functions, ”independence” and ”indetermination”,are fulfilling : this is illustrated by Figure 3. itle Suppressed Due to Excessive Length 17
Both appear as the result of a discrete optimization problem with fixedmarginal constraints; only the choice of their cost function allows the userto discriminate among the two possible approaches.
A priori one cannot re-ally justify the reason of the choice of one cost function versus the other one.However in practice, there is no doubt for anybody, most of the statisticianswill choose the ”independence coupling” as a more classical and more comfort-able, solution, but it should have been interesting, at least, on a fair intellectualstandpoint, to answer the question of the interest of the other solution .Along the same lines, introducing two ”Full Monge Matrix” forms, we haveshown that a property suitable for one situation generates by transposition asimilar property for the other one: once again, this does not induce a priori any justification for the preponderance of ”independence”.Choice of ”independence” comes from its easy interpretative power as men-tioned beforehand. Realizing an experience leading to ”independence” is natu-ral: we can explain and understand it. On the contrary, few articles propose torealize a coupling according to ”indetermination” (whose formula is given byTheorem 2). In section 4, we shall essentially work on describing correctly thislesser known coupling, hoping this will help the reader to better understandits latent structure, before applying it within the graph clustering context.
In section 4, our latent goal is to better understand the ”indeterminationcoupling”, that we have until now essentially introduced on a theoretical pointof view. Although obtained through a similar process, ”independence coupling”is straightforwardly linked to classical empirical experiences. π + does not sharethis latent simplicity and interpreting it, per se , is clearly a domain whichdeserves to be investigated. We present an attempt for helping the reader tomake an accurate picture about the ”indetermination” concepts.Interest for the coupling will be reinforced by its link with Condorcet’smajority equilibrium and its presence in several statistical criteria as shown insection 5 devoted to graph clustering. Defining a ”for” vs ”against” notion willlead us to a formal equality interpreting ”indetermination” in an other space.In fact we are faced with the famous ”Condorcet’s voting equilibrium”, whichamounts to exhibit the situation where the number of opinions ”for” balancesexactly the number of opinions ”against”.In that case, we describe an equilibrium situation, verified on a probabilisticor statistical standpoint, characterizing any measure coupling two marginsthrough ”indetermination”. The demonstration of this property requires theuse of ”Mathematical Relational Analysis” notations, which will be formallydefined hereafter. We do not want in the context of this article to developan exhaustive overview of this theory and its applications but pick up someresults in connection with the goals we want to achieve; most of them beingextracted from the following list of papers which gathers some of the most important key features about the subject: [18], [14], [19], [22], [15], [16], [1],[2]. We also interpret the equilibrium between the ”yes” (agreements) and the”no” (disagreements) (or ”for” and ”against” as well) as in an election as avoting ”indetermination situation”. This implies: since the number of votes”for” equals the number of votes ”against” we are in a situation, where it isimpossible to take a decision. The term: ”indetermination” (”indeterminacy”or ”uncertainty” should have been used as well) is a formal translation of thissurprising situation (fortunately occurring rarely). First of all, let us introduceproperly Relational Analysis notations that we shall use later on. Definition 6 (Relational Analysis notations)
Let ( u , . . . , u n ) and ( v , . . . , v n ) be two n probabilistic draws of U ∼ µ and V ∼ ν respectively. We define two associated symmetric n × n matrices X and Y by X i,j = u i = u j , ∀ ≤ i, j ≤ nY i,j = v i = v j , ∀ ≤ i, j ≤ n To understand the notation, let us begin with some remarks about defini-tion 6. Basically, the two binary matrices X and Y (which correspond in factto two binary equivalence relations based on the drawn modalities) representagreements and disagreements of the two variables on a same draw of size n ;they are symmetric with 1 values on their diagonal. This relational codinghas a lot of powerful properties, which will not be presented in this paper butwhich can be found in the articles we mentioned beforehand.Definition 6 immediately provides us with an algorithm to transfer con-tingency representations to relational ones. The way back consists in noticingthat: X i,j = 1 if and only if i and j share the same modality of U ∼ µ .Hence we assign a modality to each class defined by the equivalence relationembedded in X : the only loss of information during this process resides in thenames of modalities.Now, we are ready to present the Theorem justifying the name ”indeter-mination”: Theorem 5 ( π + and Condorcet equilibrium) π being a cross probability law on a set of p × q categorical variables, weshall say that π is an ”indetermination coupling” on its margins, if and onlyif the expected number of ”agreements” equals the number of ”disagreements”on a independent drawings of π .Proof Let π be a probability law on p × q categorical variables; it’s defined throughits values π u,v , 1 ≤ u ≤ p and 1 ≤ v ≤ q . U and V are random variables rep-resenting its margins. By n drawings through π , hence n samplings of ( U, V ), itle Suppressed Due to Excessive Length 19 U and V generates two partitions (equivalence relations) of the n individualsbased on their modalities.We will say that an agreement occurs when both partitions simultaneouslygather or separate the individuals i and j . A disagreement occurs on thecontrary when a classification regroups i and j while the other one separatesthem. Formally, if X, Y encodes the n samplings as defined in Definition 6: – X i,j Y i,j = 1, agreement of type 11, there are pq couples of classes possiblefor two individuals i and j to realize this type of agreement – X i,j Y i,j = 1, agreement of type 00, there are p ( p − q ( q −
1) couples ofclasses of this type – X i,j Y i,j = 1, disagreement of type 10, there are pq ( q −
1) couples of classesof this type – X i,j Y i,j = 1, disagreement of type 01, there are p ( p − q couples of classesof this typeAs quantities vary according to their types of agreement or disagreement,we propose the following equality which establishes that the weighted numberof agreements equals the weighted number of disagreements: XYpq + XYp ( p − q ( q −
1) =
XYpq ( q −
1) +
XYp ( p − q (9)Equality 9 is intrinsically important and appears notably in some articlesamong those we cited beforehand. It is defined on a draw of size n and linkedto a contingency indetermination. We take two draws at random independentlyunder π : ( u i , v i ) and ( u j , v j ) and introduce a probabilistic equality based onour 2 draws ( u i , v i ) and ( u j , v j ): E π ⊗ π ( X i,j Y i,j ) pq + E π ⊗ π (cid:0) X i,j Y i,j (cid:1) p ( p − q ( q −
1) = E π ⊗ π (cid:0) X i,j Y i,j (cid:1) pq ( q −
1) + E π ⊗ π (cid:0) X i,j Y i,j (cid:1) p ( p − q (10)We shall notice now that equality 10 precisely occurs when π equals theindetermination coupling of its margins with the formula introduced in Theo-rem 2. Let us compute the result of two-sized independent draws under π . – E π ⊗ π ( X i,j Y i,j ) = (cid:80) u i ,v i (cid:80) u j ,v j π u i ,v i π u j ,v j u i = u j & v j = v j = (cid:80) u,v π u,v – E π ⊗ π ( X i,j Y i,j ) = (cid:80) u i ,v i (cid:80) u j ,v j π u i ,v i π u j ,v j u i (cid:54) = u j & v i (cid:54) = v j = (cid:80) u,v π u,v (1 − π u, · − π · ,v + π u,v ) – E π ⊗ π ( X i,j Y i,j ) = (cid:80) u i ,v i (cid:80) u j ,v j π u i ,v i π u j ,v j u i = qu j & v i (cid:54) = v j = (cid:80) u,v π u,v ( π u, · − π u,v ) Inserting into equation 10, we get: (cid:80) u,v π u,v pq + (cid:80) u,v π u,v (1 − π u, · − π · ,v + π u,v ) p ( p − q ( q − (cid:80) u,v π u,v ( π u, · − π u,v ) pq ( q −
1) + (cid:80) u,v π u,v ( π · ,v − π u,v ) p ( p − q Reducing to same denominator, we get:( p − q − (cid:88) u,v π u,v + (cid:88) u,v π u,v (1 − π u, · − π · ,v + π u,v )= ( p − (cid:88) u,v π u,v ( π u, · − π u,v ) + ( q − (cid:88) u,v π u,v ( π · ,v − π u,v )regrouping the similar terms yields: pq (cid:88) u,v π u,v − p (cid:88) u π u, · − q (cid:88) v π · ,v + 1 = 0Making use of a classical equality similar to equation 3, we obtain: pq (cid:88) u,v ( π u,v − π u, · /q − π · ,v /p + 1 /pq ) = 0Finally it holds: π u,v = π u, · q + π · ,v p − pq We have proved that π equals π + if and only if the expected number of nor-malized agreements equals the expected number of disagreements on a 2-sizeddrawing.In order to give a concrete example of the notion of ”balanced voting” (alsocalled Condorcet’s Majority Voting Equilibrium), let us illustrate the conceptof ”indetermination” in a specific and interpretable case: criminal judgementsin a judicial court.Suppose we have two variables U, V . The first one U represents the resultof the judgement (with 2 possible modalities: condemnation (modality 1) orrelease (modality 0)), while the second V represents the court case status (with2 modalities as well: guilty (modality 1), innocent (modality 0)). Also, we havea distribution µ on the first variable and ν on the second. Associating a ”moralindex marker” on each case is pretty easy: –
00: corresponds to release an innocent, counted as an agreement (gooddecision) –
01: release a guilty, counted as a disagreement (bad decision) –
10: condemnation of an innocent, counted as a disagreement (bad decision) –
11: condemnation of a guilty case counted as an agreement (good decision)Optimizing a type of against votes always occurs to the expense of theother type; a tolerance level between 01 and 10 is set depending on the societyrules. Whatever the preferred ”against type” (01 or 10), any society will try itle Suppressed Due to Excessive Length 21 to decrease as much as possible the total number of ”controversial decisions”.Hence the worst court situation would be to have exactly the same numberof votes ”against” and ”for”; indeed, once that equality passed, reversing alljudgements would improve efficiency. This particular criminal judgement ”in-determination” situation occurs when agreements equal disagreements andcorresponds to have the following equilibrium:cases 00 + cases 11 = cases 10 + cases 01 i.e. expressed in probability: π , + π , = π , + π , Using the previously introduced equivalence of Theorem 3 (but here ina 2 × Definition 7 (Weighted graph)
A weighted graph G , is a graph which contains n vertices ≤ i ≤ n , which areconnected each other through edges ( i, j ) linked with weights a i,j (representing aweighted incidence matrix). We also introduce the total weight M = (cid:80) i,j a i,j . A basic way to randomly generate a graph is to use the Erds-Rnyi distri-bution:
Definition 8 (Erds-Rnyi)
Fixing a number n of vertices and (cid:15) ∈ [0 , , we link any set of two verticesby independently drawing though a Bernoulli law with parameter (cid:15) leading toa − weight. The obtained graph is non directed and each weight is or . Remark 7
Adding a parameter p representing maximum weight, we can easily create aweighted graph by drawing a Binomial law with parameter ( (cid:15), p ) while linkingcouples (instead of sets) generates oriented graphs. As mentioned in section 1, our work will be devoted to the research ofclasses, groupings, clusters or cliques (whatever we call them) within a graph.They are defined through an equivalence relation as specified in definition 9:
Definition 9 (Graph clustering)
Let us call x , a matrix representation of a binary equivalence relation, theresult of the clustering of a graph G . Then x i,j equals or and equals ifand only if the two vertices i and j are in the same class for x , and if not. Clustering algorithms aim at providing classes maximizing internal simi-larities as well as minimizing external ones. A first option is to take as inputthe number K of classes we are looking for, together with an associated dis-tance (or dissimilarity index) and come up with a list of best representativesor ”means” for each class. The output ”means” tend to optimize the sum ofdistances from all vertices to their nearest mean. K-means algorithm whoseidea goes back to the fifties (see [24]) typically illustrates this option. Havingfixed a distance and a number of classes, finding optimal means minimizingthe sum of the distances remains a NP-hard problem. A second option, is toconstruct a local criterion c which assigns a weight c i,j to each ( i, j ) coupleof vertices based on their similarity; the more similar they are, the higher thecriterion is. We then build up a global criterion by summing up the local values c i,j if and only if i and j are in the same class as proposed in problem 6. Problem 6 (Generic clustering problem) max x M ( c, x ) = n (cid:88) i =1 n (cid:88) j =1 c i,j x i,j subject to: x is an equivalence relationFirst let us remark that, as notably spotted in [15], [18], [22] an equivalencerelation constraint can be written as : • x i,i = 1 , ∀ ≤ i ≤ n (reflexivity) • x i,j = x j,i , ∀ ≤ i, j ≤ n (symmetry) • x i,j + x j,k − x j,k ≤ , ∀ ≤ i, j, k ≤ n (transitivity)Thanks to the linearity of these constraints, in addition to the linear expres-sion of the criterion itself (in terms of the unknown x i,j values), the problem 6although a priori NP-hard can be exactly solved (according to some condi-tions) through the integer relaxation of a good existing 0-1 linear programmingcode (see [18]), for problems sizes n lower than say 300. But in the contextof networks and graphs clustering, the size n of the problem (here the num-ber of vertices or nodes) can be really huge (millions for social networks) andthe direct solving by linear programming, even specially tuned, is no longerpossible; therefore, the use of robust heuristics becomes mandatory. factually this is the method of S. Lloyd(1957) rewritten by E.W. Forgy (1965) whichcorresponds to the oldest version of the K-means really useditle Suppressed Due to Excessive Length 23 ”Louvain” Algorithm (see [10] or [21]) is adequately considered as one ofthese good and available heuristics, allowing to cope with this clustering task.This algorithm relies on two steps for globally maximizing the criterion M ( c, x )based on the local costs c i,j values.0. Initially, each node in the network is assigned to its own community: thereare as many as vertices.1. In the first step, for each node i , the change from removing it from its com-munity and adding it to all its neighbors’ is computed. If M ( c, x ) increasesfor some, i is put in the locally optimal connected community. This processis applied repeatedly and sequentially to all nodes until no improvementof M ( c, x ) occurs. Once this local maximum of modularity is reached, thefirst phase has ended.2. In the second phase, the algorithm groups all the nodes in the same com-munity and builds a new network where nodes are the communities fromthe previous phase. Links between nodes of the same community are nowrepresented by self-loops on the new community node and links from mul-tiple nodes in the same community or nodes in different communities arerepresented by weighted edges between communities.3. Once the new network is created, the second phase is completed and thefirst phase can be re-applied to the new network.4. It eventually ends when the improvement on M ( c, x ) brought by the firststep is less than a chosen threshold.As mentioned beforehand, Louvain Algorithm is a good heuristic; it doesnot provide us with an exact optimal result systematically but just a quitegood approximate one. Just for a rough comparison, K-means algorithm aswell is getting an approximate solution but with a supplementary drawback:it imposes to fix a priori the number K of classes we want (which is completelyout of context when dealing with social networks or huge graphs clustering; toguess the reasonable K clusters value is then impossible or extremely greedyin computer time). In addition to that, K-means as well as Louvain algorithmdepends on vertices naming as they lexicographically and sequentially browsesthem.Whatever the costs c i,j are, an optimal solution of the global criterion M ( c, x ) exists, even if we are unable to find the optimum out, the genericLouvain algorithm gets approximate solutions which are quite satisfactory andoften sufficient for practical purposes and for most of them, close to optimality.However, this aspect concerned with the optimality and the unicity of thosesolutions x i,j has been studied in a lot of articles and books and it is not ourintention to discuss this point deeper in this paper. We will concentrate onsome other characteristics: the choice between two canonic costs at the lightof the previous sections. The original, famous, and well known Newman-Girvan’s presentation of aglobal criterion for graphs clustering, see [10] or [21], has been introduced inthe Louvain algorithm together with a global cost called ”Modularity” definedby:
Definition 10 (Modularity)
Given a partition x i,j and a graph G with weighted function a on its edges, theglobal modularity returns to: M × ( G, x ) = 12 M (cid:88) i,j (cid:104) a i,j − a i, · a · ,j M (cid:105) x i,j (11)Let us first remark that the original modularity M × is nothing but ourgeneric global cost function defined though Problem 6 with: c i,j = m × ( G ) i,j = a i,j M − a i, · a · ,j (2 M ) and that the local gain m × ( G ) i,j to put two vertices in the same class isthe local deviation to independence. Indeed, using definition 7, we know that π i,j = a i,j M can be seen as a probability measure on { . . . n } with margins µ i = a i, · M so that m × rewrites: m × ( G ) i,j = 2 M ( π i,j − µ i µ j )and does express itself as a canonic deviation to independence criterion.A second remark is that as m × ( G ) i,j expression does not contain absolutevalue or square elevation then non connected vertices will lead to negativeweights preventing them from being allocated to the same class. If they areconnected the importance of m × ( G ) i,j evolves positively as i and j have lessconnections ( a i, · and a · ,j small); here again this implies an appropriate be-havior. More precisely, since independence ensures a coupling as uniform aspossible with fixed margins (this is a solution of problem 3), m × appears asa fair construction. The criterion basically measures a distance between theobserved linkage weight and an expected flat weight given by the averageneighborhood. Problem 6 basically represents an extension of the already introduced ”Mod-ularity criterion” towards a generic criterion based on a local input one.We suggest an expression m + ( G ) i,j which represents a deviation to inde-termination. It will be used as a local cost function in 6 leading to a slightlydifferent global formula M + ( G, x ) to optimize locally: m + ( G ) i,j = a i,j − a i, · n − a · ,j n + 2 Mn itle Suppressed Due to Excessive Length 25 Symmetrically as m × , it ends up being a canonic deviation to indetermi-nation criterion. Indeed, with π i,j = a i,j M , m + rewrites: m + ( G ) i,j = 2 M ∗ (cid:18) π i,j − µ i n − µ j n + 1 n (cid:19) The global criterion being: M + ( G, x ) = (cid:88) i,j (cid:20) a i,j − a i, · n − a · ,j n + 2 Mn (cid:21) x i,j (12)We have seen in section 2.4, that the square difference between both cou-plings tends to be small. Moreover they share a lot of properties as shown insection 3 and section 4. In the same way, Patricia Conde noticed that a lotof statistical criteria (at least the most frequently used) measuring variablescorrelation are based either on a ”distance to independence”, or are straight-forwardly related to a ”distance to indetermination” (Patricia Conde gave aninteresting list in [7]). According to these remarks, our canonic deviation toindetermination criterion M + deserves to have the same types of use as thosededicated to the Newman Girvan’s M × .5.2 Erds-Renyi Experimental TestsAs already mentioned, solving problem 6 is NP-hard so that we cannot expectprecise results, neither about the number of classes for a given criterion, norabout the prediction of the running time of Louvain algorithm on a givengraph. Nevertheless, as it is based on optimizing a local criterion, we cancompare directly their local values to extrapolate a common or a distinctglobal run.We propose a comparative try based on Erds-Renyi graphs to spot dif-ferences or similarity between m × ( G ) i,j and m + ( G ) i,j values. The aim is toobserve the distribution of both criteria on a typical graph. First, to simplifyobservations and as only the reference cost varies between m + and m × , we onlykeep it by subtracting a i,j ; it is formally defined in definition 11. Then, we gen-erate graphs randomly, compute each criterion on random pairs of vertices andstore the reference cost. The experiment is formally specified in algorithm 1while the results are gathered within figure 4. Definition 11 (Bias or reference cost)
The two bias derived from m × and m + are respectively: – b × i,j = a i, · a · ,j M – b + i,j = a i, · n + a · ,j n − Mn On figure 4 we observe that both distributions are similar for any values of (cid:15) . Indeed, the curves are identical on their core values (those with a number ofrealizations upon 200). It is not really surprising because they both come from
Algorithm 1
Provides the distribution of two reference costs
Input n Input (cid:15)L + ← [] L × ← [] for R = 1 . . . do G ← Erds-Rnyi( n, (cid:15) )( i, j ) ← ( RandomUnif ( n ) , RandomUnif ( n )) L + ← L + + ( b + i,j ( G )) L × ← L + + ( b × i,j ( G )) end forreturn ( L + , L × ) Fig. 4
Empirical distribution of the two reference costs b + i,j and b × i,j for (cid:15) in [0.3, 0.6, 0.9];X-axis gives the values of the bias, Y-axis gives the corresponding number of realizations an optimization of a transport problem aiming at flattening the distribution(section 2) and they tend to be equal (section 2.4). We also notice on figure 4that their common mean is equal to the value of (cid:15) , as it can be easily derivedfrom the formulas.A difference nevertheless remains: the bias b + has smaller extreme left-side values while the bias b × has higher extreme right-side values, which isparticularly visible for (cid:15) = 0 . (cid:15) = 0 . (cid:15) = 0 . m × ( G ) i,j as well as m × ( G ) i,j only depends on the subsequent values of a i,j , a i, · and a · ,j . Plus it’s easy toget the corresponding probability of each event as expressed in proposition 3. Proposition 3 (Probability values)
Let b be a binary value, b ≤ n i ≤ n and b ≤ n j ≤ n ; let us compute thefollowing probability: P ( a i,j = b, a i, · = n i , a · ,i = n j )= (cid:15) b (1 − (cid:15) ) − b (cid:18) n − n i − b (cid:19) (cid:15) n i − b (1 − (cid:15) ) n − − n i + b (cid:18) n − n j − b (cid:19) (cid:15) n j − b (1 − (cid:15) ) n − − n j + b itle Suppressed Due to Excessive Length 27 The corresponding value m + i,j and m × i,j associated to a group ( b, n i , n j ) of theparameters being evident, we propose figure 5 which represents the differencebetween theoretical distributions of both criteria with (cid:15) = 0 . Fig. 5
Theoretical distribution of the difference m × ( G ) i,j − m + ( G ) i,j (same as b + i,j − b × i,j )on generated graphs b × and b + have distinct forms but their proximity on highly probablevalues, given on Figure 5, illustrates section 2.4: if we couple two variableswith n margins, expected difference is less than n .Extreme values, on the contrary may differ drastically. Though it seems theopposite to Figure 4 as m + comes with higher values than m × , it’s consistentbecause of the minus sign in the formula linking m with b .Having noticed that b + and b × differ on their extreme values, we computethem on a general Erds-Renyi graph (respecting the common value of 2 M = n (cid:15) ), and obtain the bounds: − (cid:15) ≤ b + ≤ nn + nn − (cid:15) = 2 − (cid:15) ≤ b × ≤ n × nn (cid:15) = 1 (cid:15) As already expected with figure 5 the difference between extreme values isarbitrarily high.5.3 Summary of an application to various graphsThe similar distributions found in section 5.2 must be confirmed through reallife applications. We gather in table 1 the number of classes found by PatriciaConde-Cespedes, who applied both criteria on the same empirical graphs. Shegot similar results, as those we expected beforehand. We present here the listof graphs she used: – Social network named ”Zachary karate club” is frequently used in socialnetwork analysis and composed of 34 members from a Karate club of anAmerican university (see [30]). – Social network named ”American College Football” gathers American foot-ball matches during year 2000. Each vertice is a team and connectionsrepresent a match (see [10]). – ”Jazz” social network represents collaborations between jazz musiciansduring years 1912 to 1940. Each vertice is a group and they are connectedif they share a musician. Data were extracted from The Red Hot JazzArchive (see [11]). – ”Internet” is a sub-graph of the Internet (see [12]). – ”Amazon” found on Amazon.com contains vertices representing productswhich are connected if they are frequently bought together (see [29]). – ”YouTube” where each vertice is a user. On YouTube, users can creategroups, two users are connected in the graph if they joined the same group(see [20]). Table 1
Number of classes found by each criteria on various graphsKarate Football Jazz Internet Amazon YouTubeN (nb vertices) 34 115 198 69 949 334 863 1 134 890M (sum of weights) 78 613 2 742 351 280 925 872 2 987 624Number of classes for criteria M × M + Table 1 can be read as follows: for example, the ”Internet” graph contains69 ,
949 vertices (nodes) with 351 ,
280 edges (links); if we apply Louvain algo-rithm on, with the global criteria M × we usually find 46 communities, while M + leads to 39.As anticipated in section 2.4 criteria are (in average) very close; conse-quently their resulting effect on various graphs is similar. Section 5.2 of thepresent paper provides the reader with an explanation of that assertion Patri-cia experimented in [7].5.4 A general remark to differentiate the two criteriaWhile section 5.2 concludes on a global symmetric behavior of both criteria,reinforced by Patricia Cond’s experimental results, summarized in section 5.3,it doesn’t prevent them from being quite different on specific graphs.Scanning up the local bias introduced in definition 11 we notice that theproduct form b × , will be small except if the mass a i, · = a · ,i of vertice i ANDthe mass of vertice j are high; the additive form of b + on the contrary will besmall unless one of the two mass a i, · OR a · ,j is big. Remembering m has tobe high to lead to a merging: itle Suppressed Due to Excessive Length 29 – m × is penalized (by b × ) if a i, · AND a · ,j are big. – m + is penalized (by b + ) as soon as a i, · OR a · ,j is big.To summary: to maximize additive form we cannot allow any of the two neigh-borhoods to wear large mass while the product form may accept one. Lever-aging that remark we can build up specific graphs to differentiate the twocriteria. Eventually, it enables us to exhibit very specific graphs kept as theyare by a criterion while merged in one class by the other, we even propose un-connected vertices regrouped in the same class because of the overall weightdistribution. The interested reader can refer to [3] for further details.5.5 A common threshold on a particular form of graphIn this section 5.5, we present a curiosity: a form of graph on which criteriashare a same merging threshold. Out of this curiosity, it presents several in-terests: first we are able to fix a threshold and secondly it is a training for amore general analyze.We propose to work on a loop of n classes like the one in figure 6 (for which n = 10) and look for a threshold on a ni,i (unique parameter) for the graph tobe left intact by M × . As any vertice has the same environment, we may selectany b i,j : they are all equal. Counting edges, requiring b × i,j = b × ≥ M = n × a i, · a i, · = 2 + a ni,i a i, · M ≥ M and, solving a square equation ( u = a ni,i , u + (4 − n ) u + 2(2 − n ) ≥ a ni,i ≥ n −
2. For instance, in our example, n = 10 so that 8 is thethreshold explaining figure 6 is a convenient final graph for M × .If we look at the behavior of M + on that very graph, we observe that ourmerging threshold equation is: b + i,j ≥ ∗ a i, · n − Mn ≥ ∗ a ni,i + 2 n − n ( a ni,i + 2) n ≥ a ni,i ≥ n − M × to have a threshold, wenotice that not only M + also has one threshold but both are equal. The parallelproperties of the two coupling functions appear here with a graph application. Fig. 6
Example of a convenient final graph clustering for M × Remark 8 (Curiosity?)
As mentioned in remark 2.4, the two coupling func-tions are equal when one of the margin is uniform. Given the form of graphin figure 6, all vertices are symmetric one to another so that any distributionbased on the neighborhood is uniform: it explains the result.
First, we followed the historical line and introduced two basic notions extractedfrom Discrete Optimal Transport Theory: independence and indetermination .As recalled, the first one is the most intuitive and frequently used in mathemat-ical articles as well as experimented in real life. The second notion appearedmore surprising, poorly studied in the statistical literature but more commonlyused by people working on Mathematical Relational Analysis Voting Theoryand Analysis of Variance. Together, they cover the only two canonic projectioncosts as quoted in section 2.5.To illustrate the usefulness of the parallel construction, we turned to appli-cations and completed the track introduced by Patricia Conde in her thesis [7].She gathered a list of graphs clustering criteria and classified them accordingto their deviation to one of the two previously mentioned coupling functions. itle Suppressed Due to Excessive Length 31
Section 5 reports a further analyze of the two canonical criteria. It gathersresults about the general similarity of their application on various graphs aswell as their extreme values to set one another apart. Subsection 2.4, notably,shows that they slightly differ and it explains the experimental results.In each section, from optimal transport to graph theory, we insisted onthe parallel between both notions together with their differences. As quotedbeforehand, they appear as the two unique canonic structural solutions. A par-ticularly curious situation is their duality when we pass from contingency torelational notations. It was first spotted in [17] and needs to be further under-stood. Generally, the differences between them needs to be scanned up, eitherto coin a macro criteria, or to chose wisely between one or another dependingon the structure of the graph. In any case, the traditional use of indepen-dence at the expense of indetermination needs to be be further motivated andexplained.
References
1. Ah-Pine, J.: Sur des aspects algbriques et combinatoires de l’analyse relationnelle: appli-cations en classification automatique, en thorie du choix social et en thorie des tresses.Ph.D. thesis, Paris 6 (2007)2. Ah-Pine, J.: On aggregating binary relations using 0-1 integer linear programming.workshop ISAIM (2009)3. Bertrand, P.: Transport optimal, matrices de monge et pont relationnel. Ph.D. thesis,Paris 6 (to be defended 2021)4. Blondel, V., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communitiesin large networks. Journal of Statistical Mechanics: Theory and Experiment p. P10008(2008)5. Burkard, R.E., Klinz, B., Rudolf, R.: Perspectives of monge properties in optimization.Discrete Applied Mathematics , 95–161 (1996)6. Campigotto, R., Conde-Cespedes, P., Guillaume, J.L.: A generalized and adaptivemethod for community detection. Arxiv (2013)7. Conde-Cespedes, P.: Modlisations et extensions du formalisme de l’analyse relationnellemathmatique la modularisation des grands graphes. Ph.D. thesis, Paris 6 (2013)8. Csiszar, I., et al.: Why least squares and maximum entropy? an axiomatic approach toinference for linear inverse problems. The annals of statistics (4), 2032–2066 (1991)9. Frchet, M.: Sur les tableaux de corrlations dont les marges sont donnes. Annales de lUniversit de Lyon, Section. A , 53–77 (1951)10. Girvan, M., Newman, M.E.J.: Community structure in social and biological networks.Proceedings of the National Academy of Sciences of the United States of America p.78217826 (2002)11. Gleiser, P.M., Danon, L.: Community structure in jazz. Advances in Complex Sys-tems(ACS) pp. 565–573 (2003)12. Hoerdt, M., Magoni, D.: Proceedings of the 11th International Conference on Soft-ware,Telecommunications and Computer Networks 257 (2003)13. Hoffman, A.J.: On simple linear programming problems. Proceedings of the SeventhSymposium in Pure Mathematics of the AMS pp. 317–327 (1963)14. Marcotorchino, J.F.: Utilisation des comparaisons par paires en statistique des contin-gences. Publication du Centre Scientifique IBM de Paris et Cahiers du Sminaire Analysedes Donnes et Processus Stochastiques Universit Libre de Bruxelles pp. 1–57 (1984)15. Marcotorchino, J.F.: Maximal association theory as a tool of research. Classification asa tool of research , W.Gaul and M. Schader editors, North Holland Amsterdam (1986)16. Marcotorchino, J.F.: Seriation problems:an overview. Applied Stochastic Models andData Analysis , 139–151 (1991)2 P. Bertrand et al.17. Marcotorchino, J.F., Cespedes, P.C.: Optimal transport and minimal trade problem,impacts on relational metrics and applications to large graphs and networks modularity.Geometric Science of Information pp. 169–179 (2013)18. Marcotorchino, J.F., Michaud, P.: Optimisation en analyse ordinale des donnes. Bookby Masson pp. 1–211 (1979)19. Messatfa, H.: Maximal association for the sum of squares of a contingency table. RevueRAIRO, Recherche Oprationnelle , 29–47 (1990)20. Mislove, A., Marcon, M., Gummadi, K.P., Druschel, P., Bhattacharjee, B.: Measure-mentand analysis of online social networks. Proceedings of the 5th ACM/Usenix Inter-netMeasurement Conference (IMC07) (2007)21. Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in networks.Physical review E (2), 026113 (2004)22. Opitz, O., Paul, H.: Aggregation of ordinal judgements based on condorcets majorityrule. Data Analysis and Decision Support. Studies in Classification, Data Analysis, andKnowledge Organization. Springer, Berlin, Heidelberg (2005)23. Sklar, A.: Random variables, joint distribution functions, and copulas. Random vari-ables, joint distribution functions, and copulas pp. 449–460 (1973)24. Steinhaus, H.: Sur la division des corps matriels en parties. Bulletin de lacademiepolonaise des sciences, v. 4, no. 12 p. 801804 (1957)25. Stemmelen, E.: Tableaux dchanges, description et prvision. Cahiers du Bureau Univer-sitaire de Recherche Oprationnelle (1977)26. Wilson, A.G.: A statistical theory of spatial distribution models. Transportation Re-search , 253–269 (1967)27. Wilson, A.G.: The use of entropy maximising models. Journal of transport economiesand policy , 108–126 (1969)28. Wilson, A.G.: Entropy in urban and regional modelling (1970)29. Yang, J., Leskovec, J.: Defining and evaluating network communities based on ground-truth. nternational Conference on Data Mining pp. 745–754 (2012)30. Zachary, W.W.: An information flow model for conflict and fission in small groups.Journal of anthropological research33