Markov random fields factorization with context-specific independences
aa r X i v : . [ c s . A I] J un Markov random fields factorization with context-specificindependences
Alejandro Edera
Dep. de Sistemas de Computaci´onUniversidad Tecnol´ogica Nacional
Facundo Bromberg
Dep. de Sistemas de Computaci´onUniversidad Tecnol´ogica Nacional
Federico Schl¨uter
Dep. de Sistemas de Computaci´onUniversidad Tecnol´ogica Nacional
Abstract
Markov random fields provide a compact rep-resentation of joint probability distributionsby representing its independence propertiesin an undirected graph. The well-knownHammersley-Clifford theorem uses these con-ditional independences to factorize a Gibbsdistribution into a set of factors. However,an important issue of using a graph to repre-sent independences is that it cannot encodesome types of independence relations, suchas the context-specific independences (CSIs).They are a particular case of conditional in-dependences that is true only for a certainassignment of its conditioning set; in contrastto conditional independences that must holdfor all its assignments. This work presentsa method for factorizing a Markov randomfield according to CSIs present in a distribu-tion, and formally guarantees that this fac-torization is correct. This is presented inour main contribution, the context-specificHammersley-Clifford theorem, a generaliza-tion to CSIs of the Hammersley-Clifford the-orem that applies for conditional indepen-dences.
Markov random fields (MRFs), also known as undi-rected graphical models, or Markov networks, be-long to the family of probabilistic graphical models(Koller and Friedman, 2009), a well-known computa-tional framework for compact representation of jointprobability distributions. These models are composedof an independence structure, and a set of numer-ical parameters. The independence structure is anundirected graph that encodes compactly the condi-tional independences among the variables in the do- main. Given the structure, the numerical parametersquantify the relationships in the structure. Probabil-ity distributions present in practice important com-plexity deficiencies, with exponential space complex-ity of their representation, time complexity of infer-ence, and sample complexity when learning them fromdata. Based on the structure of independences, itis possible to represent efficiently the joint probabil-ity distribution by factorizing it into smaller functions(or factors), each over a subset of the domain vari-ables, resulting some times in exponential reductionsin these complexities. This factorization can be doneby using the well-known Hammersley-Clifford theorem(Hammersley and Clifford, 1971).An important issue of using a graph to represent in-dependences is that it cannot encode some types ofindependence relations, such as the context-specificindependences (CSIs) (Boutilier et al., 1996). Theseindependences are similiar to conditional indepen-dences except that are only true for certain assign-ments of its conditioning set. The CSIs have beenapplied in a wide range of scenarios achieving sig-nificant improvements in time, space and samplecomplexities, in comparison with other approachesthat only uses conditional independences encodedby the graph. (Chickering et al., 1997; Fierens,2010; Poole and Zhang, 2003; Wexler and Meek, 2008;Lowd and Davis, 2010; Ravikumar et al., 2010). Inthese contributions, the CSIs are encoded in alterna-tive data structures (e.g., using a decision tree insteadof a graph). This is carried out by assuming that thefactors of the distribution are conditional probabilitydistributions. In this sense, the CSIs are not used tofactorize the distribution, but they are used for repre-senting efficiently the factors.The main contribution of our work is the context-specific Hammersley-Clifford theorem. The impor-tance of this theoretical result lies in that it al-lows to factorize a distribution using CSIs, to ob-tain a more sparse representation than that obtainedith conditional independences, providing theoreti-cal guarantees. For this, a log-linear model is usedas a more fine-grained representation of the MRFs(Koller and Friedman, 2009). By using such mod-els it is possible to extend the advantages of theHammersley-Clifford theorem, that is, improvementsin time, space and sample complexities.The remainder of this work is organized as follows.The next Section provides a summary of the relatedwork in the literature. Section 3 presents an overviewof how to factorize a distribution by exploiting its in-dependences. Section 4 formally describes the context-specific Hammersley-Clifford theorem that factorizes alog-linear model according to a set of CSIs. The paperconcludes with a summary in Section 5.
There are several works in the literature(Della Pietra et al., 1997; Lee et al., 2006;Lowd and Davis, 2010; Van Haaren and Davis,2012) that learn log-linear models directly by present-ing different procedures for selecting features fromdata. Neither of these works discuss CSIs, nor presentany guarantee on how the log-linear model generatedis related to the underlying distribution.CSIs were first introduced by (Boutilier et al., 1996)by coding them locally within conditional probabilitytables (factors) of Bayesian networks as decision trees.Their approach is hybrid, encoding conditional inde-pendencies in the directed graph and CSIs as decisiontrees over the variables of a conditional probability ta-ble. Also, their work presents theoretical results fora sound graphical representation. This work insteadproposes a unified representation for CSIs and con-ditional independencies into a log-linear model. Assuch, it requires first theoretical guarantees on howa distribution factorizes according to this model (notneeded for the work of Boutlier as the factorizationinto conditional probability tables is not affected bythe CSIs). It remains for future investigation to findan efficient graphical representation (and theoreticalguarantees thereon).The work of (Gogate et al., 2010) is the closests to ourwork, presenting an algorithm for factorizing a log-linear model according to CSIs. For that it introducesa statistical independence test for eliciting this inde-pendencies from data. The work assumes the under-lying distribution to be a thin junction tree . Althoughsome theoretical results are presented that guaranteean efficient computational performance, no results arepresented that guarantee the factorization proposed issound.
This section provides some background on MRFs, ex-plaining how to factorize a distribution by exploitingits independences. Let us start by introducing somenecessary notation. We use capital letters for sets ofindexes, reserving the X letter for the domain of adistribution, and V for the nodes of a graph. Let X = ( X a , X b , . . . , X n ) represent a vector of n = | X | random variables. The Val( X a ) function returns allthe values of the domain of X a , and Val( X U ) re-turns all the possible values of the set of variables X U = ( X i , i ∈ U ). Let x = ( x a , x b , . . . , x n ) be a com-plete assignment of X . The values of X a are denotedby x ja ∈ Val( X a ), where j = 1 , . . . , | Val( X a ) | . Finally,we denote by x h W i the value taken by variables X W in the complete assignment x .Conditional independences are regularities of distribu-tions that has been extensively studied in the fieldof statistics, demonstrating how they can be effec-tively and soundly used for reducing the dimensional-ity of the distribution (Pearl, 1988; Spirtes et al., 2000;Koller and Friedman, 2009). Formally, a conditionalindependence is defined as follows: Definition 1.
Conditional independence . Let X a , X b ∈ X be two random variables, and X U ⊆ X \ { X a , X b } be a set of variables. We say that X a and X b are conditionally independent given X U , de-noted as I ( X a , X b | X U ) , if and only if for all values x a ∈ Val( X a ) , x b ∈ Val( X b ) , and x U ∈ Val( X U ) : p ( X a | X b , X U ) = p ( X a | X U ) , (1) whenever p ( X b , X U ) > . Through the notion of conditional independence it ispossible to construct a dependency model I , definedformally as follows: Definition 2.
Dependency model .A dependency model I is a discrete function that re-turns a truth value, given an input triplet h X a , X b | X U i , for all X a , X b ∈ X , X U ⊆ X \ { X a , X b } . Remark.
An alternative viewpoint of the above defi-nition can be obtained by considering that every triplet h X a , X b | X U i over a domain X are implicitly condi-tioned by a constant assignment to some external vari-able of the domain E = e . In this sense, all the tripletsof the dependency model become to be conditioned bythe assignment E = e .In that sense, any probability distribution is a de-pendency model, because for any conditional indepen-dence assertion it is possible to test its truth valueusing Equation (1). In this work, we are particu-larly interested in the set of dependency models thatre graph-isomorph , that is when all its independencesand dependences can be represented in an undirectedgraph. Formally, an undirected graph G = ( V, E ) isdefined by a set of nodes V = ( a, b, . . . , n ), and a setof edges E ⊂ V × V . Each node a ∈ V is associ-ated with a random variable X a ∈ X , and each edge( a, b ) ∈ E represents a direct probabilistic influencebetween X a and X b . A necessary and sufficient con-dition for dependency models to be graph-isomorph isthat all its independence assertions satisfy the follow-ing independence axioms, commonly called the Pearlaxioms (Pearl and Paz, 1985):
Symmetry I ( X A , X B | X U ) ⇔ I ( X B , X A | X U ) (2)Decomposition I ( X A , X B ∪ X W | X U ) ⇒ I ( X A , X B | X U ) & I ( X A , X W | X U )(3)Intersection I ( X A , X B | X U ∪ X W ) & I ( X A , X W | X U ∪ X B ) ⇒ I ( X A , X B ∪ X W | X U ) (4)Strong union I ( X A , X B | X W ) ⇒ I ( X A , X B | X W ∪ X U ) (5)Transitivity I ( X A , X B | X W ) ⇒ I ( X A , X c | X W ) or I ( X c , X B | X W ) (6) Other important property that we will need later to re-construct graphs from dependency models is the pair-wise Markov property, that asserts that an undirectedgraph can be built from a dependency model which isgraph-isomorph, as follows:
Definition 3 (Pairwise Markov property(Koller and Friedman, 2009)) . Let G be a graphover X . Two nodes a and b are non-adjacent ifand only if the random variables X a and X b areconditionally independent given all other variables X \ { X a , X b } , i.e., I ( X a , X b | X \ { X a , X b } ) iff ( a, b ) / ∈ E. (7)If every independence assertion contained in a depen-dency model I holds for p ( X ), I is said to be an I-map of p ( X ). In a similar fashion, we say that G is also anI-map of p ( X ). The pairwise property is necessary forthose cases for which the graph can only encode a sub-set of the independences present in the distribution.A distribution can present additional type of indepen-dences. In this work we focus in a finer-grained typeof independences: the context-specific independences (CSI) (Boutilier et al., 1996; Geiger and Heckerman,1996; Chickering et al., 1997; Koller and Friedman,2009). These independences are similar to conditional independences, but hold for a specific assignment ofthe conditioning set, called the context of the inde-pendence. We define CSIs formally as follows: Definition 4 (Context-specific independence(Boutilier et al., 1996)) . Let X a , X b ∈ X be tworandom variables, X U , X W ⊆ X \ { X a , X b } be pair-wise disjoint sets of variables that does not contain X a , X b ; and x W some assignment of X W . We saythat variables X a and X b are contextually indepen-dent given X U and a context X W = x W , denoted I ( X a , X b | X U , x W ) , if and only if p ( X a | X b , X U , x W ) = p ( X a | X U , x W ) , (8) whenever p ( X b , X U , x W ) > . Interestingly, a conditional independence assertioncan be seen as a conjunction of CSIs, that is, theCSIs for all the contexts of the conditioning set ofthe conditional independence. Since each CSIs isdefined for a specific context, they cannot be rep-resented all together in a single undirected graph(Koller and Friedman, 2009). Instead, they can becaptured by a dependency model I , extended for CSIsby using Equation (8) to test the validity of everyassertion I ( X a , X b | X U , x W ). We call this modela context-specific dependency model I c . If every in-dependence assertion contained in I c holds for p ( X ), I c is said to be an CSI-map of p ( X ) (Boutilier et al.,1996). We define formally the Context-specific depen-dency model as follows: Definition 5.
Context-specific dependency model. Adependency model I c is a discrete function that returnsa truth value given an input triplet h X a , X b | X U , x W i ,for all X a , X b ∈ X , X U ⊆ X \ { X a , X b } , and x W acontext over the subset X W ⊆ X . A MRF uses an undirected graph G and a set of numer-ical parameters θ ∈ R to represent a distribution. Thecompletely connected sub-graphs of G (a.k.a., cliques )can be used to factorize the distribution into a set of potential functions { φ C ( X C ) : C ∈ cliques ( G )) } oflower dimension than p ( X ), parameterized by θ . Thefollowing theorem shows how to factorize the distribu-tion: Theorem 1 (Hammersley-Clifford(Hammersley and Clifford, 1971)) . Let p ( X ) be apositive distribution over the domain of variables X ,and let G be an undirected graph over X . If G is anI-map of p ( X ) , then p ( X ) can be factorized as: p ( X ) = exp { X C ∈ cliques ( G ) φ C ( X C ) − ln( Z ) } , (9) where Z is a normalizing constant. distribution factorized by the above theorem iscalled a Gibbs distribution . The most n¨aive form con-tains potentials φ C ( · ) represented by tables, whereeach entry corresponds to an assignment x C ∈ Val( X C ) that has associated a numerical parameter.Despite the clear benefit of the factorization describedby the Hammersley Clifford theorem, the representa-tion of a factor as a potential does not allow to encodeCSIs. These patterns are more easily encoded in amore convenient representation called log-linear . Thelog-linear model represents a Gibbs distribution by us-ing a set of features F to represent the potentials. Afeature is an assignment to a subset of variables of do-main. We denote a features as f jC , to make more clearthe distinction between the features of a log-linear andits input assignment x . Thus, a potential in a log-linear is represented as a linear combination of featuresas follows: φ C ( X C = x h C i ) = | Val( X C ) | X j θ j δ ( x h C i , f jC ) , where δ ( x h C i , f jC ) is the Kronecker delta function, thatis, it equals to 1 when x h C i = f jC , and 0 other-wise. By joining the linear combinations of all thepotentials and merging its indexes into a unique index α ∈ { , . . . , |F|} , we can represent Equation (9) byusing the following log-linear model: p ( X = x ) = exp { X α θ α δ ( x h C α i , f αC α ) − ln( Z ) } . (10)In the next section we present the context-specificHammersley-Clifford theorem, a generalization of theHammersley Clifford theorem that shows how to fac-torize a distribution (represented by a log-linear) usinga context-specific dependency model I c that capturesthe CSIs. This section presents the main contribution of thiswork: a generalization of the Hammersley-Cliffordtheorem for factorizing a distribution represented bya log-linear based on a context-specific dependencymodel I c CSI-map of p ( X ). For this, we begin bydefining the following Corollary of the Hammersley-Clifford theorem: Corollary 1 (Independence-based Hammersley-Clif-ford) . Let p ( X ) be a positive distribution, and let I bea graph-isomorph dependency model over X . If I is an I-map of p ( X ) , then p ( X ) can be factorized into aset of potential functions { φ C i ( X C i ) } i , such that forany I ( X a , X b | X W ) that is true in I , there is no fac-tor φ i ( X C i ) that contains both variables X a and X b in X C i .Proof. From the assmuptions, I is graph-isomorphand is an I-map of p ( x ). By definition, the formerimplies there exists an undirected graph G ( V, E ) thatexactly encodes I , and it therefore must also be I-map of p ( x ). The assumptions of the Hammersley-Clifford Theorem 1 hold, so p ( X ) can be factorizedinto a set of potential functions over the cliques of G . Also, since I is graph-isomorph, its conditionalindependences satisfy the Pearl axiom, in particularthe strong union axiom. Therefore if conditional in-dependence I ( X a , X b | X W ) is in I , the conditionalindependence I ( X a , X b | X \ X a , X b ) is also in I . Us-ing this fact in the pairwise Markov property we canimply the no-edge ( a, b ) / ∈ E ; in other words, a and b cannot belong to the same clique. Since Hammersley-Clifford holds, this last fact implies no factor φ i ( X C i )can contain both variables X a and X b in X C i .This corollary shows how to use a dependency model I (instead of a graph) to factorize the distribution p ( X ). In what follows, we present theoretical resultsthat show how a context-specific dependency model I c can be used to factorize P ( X ). The general rationaleis to decompose I c into subsets of CSIs contextual-ized on certain context x W that are themselves de-pendency models over sub-domains, and use those todecompose the conditional distributions of p ( X ) usingHammersley-Clifford. Definition 6 (Reduced dependency model) . Let p ( X ) be a distribution over X , x W a context over subset X W ⊆ X , and I c a context-specific dependency modelover X . We define the reduced dependency model I x W of I c over domain X \ X W as the rule that foreach X a , X b ∈ X , each pair X U , X W of disjoint subsetsof X \ { X a , X b } , and each assignments x W of X W ,assigns a truth value to a triplet h X a , X b | X U , x W i from independence assertions in I c as follows: I x W ( h X a , X b | X U , x W i ) = (11) ^ X U ∈ V al ( X U ) I c ( h X a , X b | x U , x W i )The following proposition relates the CSI-mapness of acontext-specific dependency model and the I-mapnessof its reduced dependency models. Proposition 1.
Let p ( X ) be a distribution over X , x W be a context over subset of X , and I c be a context-pecific dependency model over X . If I c is a CSI-mapof p ( X ) , then I x W is an I-map of p ( X \ X W | x W ) .Proof. We start arguing that I x W is a CSI-map of p ( X ), and then extend the proof to show that it isan I-map of the conditional p ( X \ X W | x W ). That I x W is a CSI-map of p ( X ) follows from the fact that I c is a CSI-map of p ( X ), that implies that not only itsCSIs holds in p ( X ), but any CSI obtained by conjoin-ing those CSIs over all values of any of its variables, inparticular the conjunction of Equation (11). That I x W is an I-map of p ( X \ X W | x W ) follows from the factthat any CSI I ( X a , X b | X U , x W ) in p ( X ) is equivalentto a conditional independence I ( X a , X b | X U ) in theconditional p ( X \ X W | x W ).In the next auxiliary lemma it is shown how to fac-torize a distribution p ( X ) using a dependency model I x W : Auxiliary Lemma 1.
Let p ( X ) be a positive dis-tribution over X , I c be a dependency model over X ,and I x W be a graph-isomorph dependency model over X \ X W . If I x W is an I-map of the conditional p ( X \ X W | x W ) , then this conditional can be factor-ized into a set of potential functions { φ i ( X C i ) } i over X \ X W , such that for any I ( X a , X b | X U , x W ) that istrue in I x W , there is no factor φ i ( X C i ) that containsboth a and b in C i .Proof. The proof consists on using Corollary 1 for theconditional p ( X \ X W | x W ) as the distribution, and I x W as the dependency model. For that, we showthey satisfy the requirements of the Corollary, thatis, p ( X \ X W | x W ) is positive, and I x W is a graph-isomorph dependency model over domain X X W thatis an I-map of the conditional. The I x W is an I-map ofthe conditional and graph-isomorph follows from theassumptions. It remains to prove then the positivity ofthe conditional. For that, the conditional is expandedas follows: p ( X \ { X W } | x W ) = p ( X \ { X W } , x W ) p ( x W )= p ( X \ { X W } , x W ) P x X \ XW ∈ Val( X \{ X W } ) p ( x X \ W , x W ) , where the sum expansion of the denominator followsfrom the law of total probability. The conditional hasbeen expressed then as an operation over joints, andbeing all positive, it follows that both the numeratorand denominator, and therefore the whole quotient ispositive. With Lemma 1, we can present our main theoreticalresult, a theorem that generalizes Theorem 1 to fac-torize the features F in a log-linear of p ( X ) accordingto some given context-specific dependency model I c .For this, we need to define precisely what we meanby factorization of a set of features F . We do this intwo steps, one that defines a factorization according todependency models, and then the contextualized casefor context-specific dependency models. Definition 7 (Feature factorization) . Let F be a set offeatures over some domain X , and I x W some reduceddependency model over X X W . We say features F fac-torize according to I x W if for each I ( X a , X b | X U , x W ) that is true in I x W , and each feature f C ∈ F such that f C h W i = x W , it holds that either a / ∈ C or b / ∈ C . Definition 8 (Context-specific feature factorization) . Let F be a set of features over some domain X , and I c be a context-specific dependency model. The features F are said to factorize according to I c if they factorizeaccording to each reduced dependency model I x w of I c (as defined by Definition 7), with X W ⊆ X , and x W ∈ Val( X W ) . We present now our main theorem, and then discusspractical issues regarding its requirements.
Theorem 2 (Context-Specific Hammersley-Clifford) . Let p ( X ) be a positive distribution over X , F be aset of features from a log-linear of p ( X ) , and I c bea context-specific dependency model over X , such thateach of its reduced dependency models (over all possiblecontexts) is graph-isomorph. If I c is CSI-map of p ( X ) then F factorizes according to I c .Proof. From the definition of context-specific featurefactorization, the conclusion of the theorem holds if F factorizes according to each reduced dependencymodel of I c . So let I x W be some arbitrary reduceddependency model for context x W , and prove F factor-izes according to I x W , which by Definition 7 requiresthat (a) for each I ( X a , X b | X u , x W ) that is true in I x W , and (b) for each f C ∈ F s.t. f C h W i = x W , itholds that (c) either a / ∈ C or b / ∈ C .To proceed then, we first apply the Auxiliary Lemma 1for p ( X ), the context x W , and the reduced dependencymodel I x W . These requirements are satisfied, that is, p ( X ) is positive and I x W is both graph-isomorph andI-map of the conditional p ( X \ X W | x W ) (by Proposi-tion 1). From this we conclude the consequent of theLemma, i.e., that the conditional p ( X \ X W | x W )can be factorized into a set of potencial functions { φ i ( X C i ) } i s.t. (i) for each I ( X a , X b | X U , x W ) that istrue in I x W , (ii) for each factor φ i ( X C i ) ∈ { φ i ( X C i ) } i ,it holds that (iii) either a / ∈ C i or b / ∈ C i .To conclude then, we argue that conclusions (i), (ii)nd (iii) of the Auxiliary Lemma are equivalent tothe requirements (a), (b), and (c) of the factorization.Clearly, conclusions (i) and (iii) matches requirement(a) and (c) of the factorization. We now show theequivalence of (ii) with (b). A factor φ i ( X C i ) of theconditional p ( X \ X W | x W ) is equivalent to a factor φ i ( X C i , x W ) over the joint p ( X ), which is composedof features f C i ∪ W whose values over X W matches x W ,i.e., f C i ∪ W h W i = x W .The theorem requires that each possible reduced de-pendency model of I c be graph-isomorph. What isthe implication of this requirement? By definition ofgraph-isomorphism, this implies that for each possi-ble context x W , the reduced dependency model I x W can be encoded as an undirected graph over the sub-domain X \ X W . This provides us a mean to con-struct I c graphically, i.e., constructing an undirectedgraph for each possible sub-domain and assignmentof its complement. In practice, this may be done byexperts that provide a list of CSIs that hold in the do-main, or running a structure learning algorithm overeach context. This may sound overly complex, as thereare cleary an exponential number of such contexts. Nodoubt future works can explore this aspect, finding al-ternatives for simplifying this complexity on differentspecial cases. We have presented a theoretical method for factor-izing a Markov random field according to the CSIspresent in a distribution, that is formally guaranteedto be correct. This is presented by the context-specificHammersley-Clifford theorem, as a generalization toCSIs of the Hammersley-Clifford theorem that appliesfor conditional independences. According with ourtheoretical result, we believe that it is worth guidingour future work in implementing algorithms for learn-ing from data the structure of MRFs for each possiblecontext, and then factorizing the distribution by usingthe learned structures. Intuitively, it seems likely toachieve improvements in time, space and sample com-plexities, in comparison with other approaches thatonly uses conditional independences encoded by thegraph.
References
Boutilier, C., Friedman, N., Goldszmidt, M., andKoller, D. (1996). Context-specific independence inBayesian networks. In
Proceedings of the Twelfthinternational conference on Uncertainty in artifi-cial intelligence , pages 115–123. Morgan KaufmannPublishers Inc. Chickering, D. M., Heckerman, D., and Meek, C.(1997). A Bayesian Approach to Learning BayesianNetworks with Local Structure. In
Uncertainty inArtificial Intelligence , pages 80–89. Morgan Kauf-mann Publishers Inc.Della Pietra, S., Della Pietra, V. J., and Lafferty, J. D.(1997). Inducing Features of Random Fields.
IEEETrans. PAMI. , 19(4):380–393.Fierens, D. (2010). Context-specific independence indirected relational probabilistic models and its in-fluence on the efficiency of Gibbs sampling. In
Eu-ropean Conference on Artificial Intelligence , pages243–248.Geiger, D. and Heckerman, D. (1996). Knowledge rep-resentation and inference in similarity networks andBayesian multinets.
Artificial Intelligence , 82:45–74.Gogate, V., Austin, W., and Domingos, W. P. (2010).Learning efficient markov networks.Hammersley, J. M. and Clifford, P. (1971). Markovfields on finite graphs and lattices.Heckerman, D., Geiger, D., and Chickering, D. M.(1995). Learning Bayesian networks: The combi-nation of knowledge and statistical data.
MachineLearning .Koller, D. and Friedman, N. (2009).
ProbabilisticGraphical Models: Principles and Techniques . MITPress, Cambridge.Lam, W. and Bacchus, F. (1994). Learning Bayesianbelief networks: an approach based on the MDLprinciple.
Computational Intelligence , 10:269–293.Lee, S., Ganapathi, V., and Koller, D. (2006). Ef-ficient structure learning of Markov networks usingL1-regularization. In
Neural Information ProcessingSystems . Citeseer.Lowd, D. and Davis, J. (2010). Learning markov net-work structure with decision trees. In
Data Mining(ICDM), 2010 IEEE 10th International Conferenceon , pages 334–343. IEEE.McCallum, A. (2003). Efficiently inducing features ofconditional random fields. In
Proceedings of Uncer-tainty in Artificial Intelligence (UAI) .Pearl, J. (1988).
Probabilistic Reasoning in IntelligentSystems: Networks of Plausible Inference . MorganKaufmann Publishers, Inc., 1re edition.Pearl, J. and Paz, A. (1985). GRAPHOIDS : A graphbased logic for reasonning ab relevance relations.Technical Report 850038 (R-53-L), Cognitive Sys-tems Laboratory, University of California, Los An-geles.Poole, D. and Zhang, N. L. (2003). Exploiting con-textual independence in probabilistic inference.
J.Artif. Intell. Res. (JAIR) , 18:263–313.avikumar, P., Wainwright, M. J., and Lafferty, J. D.(2010). High-dimensional Ising model selection us-ing L1-regularized logistic regression.
Annals ofStatistics , 38:1287–1319.Spirtes, P., Glymour, C., and Scheines, R. (2000).
Causation, Prediction, and Search . Adaptive Com-putation and Machine Learning Series. MIT Press.Wexler, Y. and Meek, C. (2008). Inference for multi-plicative models. In