[PDF] Is profile likelihood a true likelihood? An argument in favor

Abstract

Profile likelihood is the key tool for dealing with nuisance parameters in likelihood theory. It is often asserted, however, that profile likelihood is not a 'true' likelihood. One implication is that likelihood theory lacks the generality of e.g. Bayesian inference, wherein marginalization is the universal tool for dealing with nuisance parameters. Here we argue that profile likelihood has as much claim to being a true likelihood as a marginal probability has to being a true probability distribution. The crucial point we argue is that a likelihood function is naturally interpreted as a maxitive possibility measure: given this, the associated theory of integration with respect to maxitive measures delivers profile likelihood as the direct analogue of marginal probability in additive measure theory. Thus, given a background likelihood function, we argue that profiling over the likelihood function is as natural (or as unnatural, as the case may be) as marginalizing over a background probability measure. The connections to Bayesian inference can also be further clarified with the introduction of a suitable logarithmic distance function, in which case the present theory can be naturally described as 'Tropical Bayes' in the sense of tropical algebra.

Full PDF

aa r X i v : . [ m a t h . S T ] J u l Is proﬁle likelihood a truelikelihood? An argument infavor

Oliver J. Maclaren

Department of Engineering Science, University of Auckland, Auckland, New Zealand

Abstract.

Proﬁle likelihood is the key tool for dealing with nuisance pa-rameters in likelihood theory. It is often asserted, however, that proﬁlelikelihood is not a ‘true’ likelihood. One implication is that likelihoodtheory lacks the generality of e.g. Bayesian inference, wherein marginal-ization is the universal tool for dealing with nuisance parameters. Herewe argue that proﬁle likelihood has as much claim to being a truelikelihood as a marginal probability has to being a true probability dis-tribution. The crucial point we argue is that a likelihood function isnaturally interpreted as a maxitive possibility measure: given this, theassociated theory of integration with respect to maxitive measures de-livers proﬁle likelihood as the direct analogue of marginal probability inadditive measure theory. Thus, given a background likelihood function,we argue that proﬁling over the likelihood function is as natural (oras unnatural, as the case may be) as marginalizing over a backgroundprobability measure. The connections to Bayesian inference can also befurther clariﬁed with the introduction of a suitable logarithmic distancefunction, in which case the present theory can be naturally describedas ‘Tropical Bayes’ in the sense of tropical algebra.

Key words and phrases:

Estimation, Inference, Proﬁle Likelihood, Marginal-ization, Nuisance Parameters, Idempotent Integration, Maxitive Mea-sure Theory, Tropical Algebra, Tropical Bayes.

1. INTRODUCTION

Consider the opening sentence from the entry on proﬁle likelihood in the En-cyclopedia of Biostatistics (Aitkin, 2005):

The proﬁle likelihood is not a likelihood, but a likelihood maximized over nuisanceparameters given the values of the parameters of interest.

Numerous similar assertions that proﬁle likelihood is not a ‘true’ likelihoodmay be found throughout the literature and various textbooks, and is apparentlythe accepted viewpoint of the statistical community. Importantly, this includesthe ‘pure’ likelihood literature, which generally accepts a lack of systematic meth-ods for dealing with nuisance parameters, while still recommending proﬁle likeli-

Department of Engineering Science, The University of Auckland, Auckland1142, New Zealand (e-mail: [email protected]). MACLAREN hood as the most general, albeit ‘ad-hoc’, solution (see e.g. Royall, 1997; Rohde,2014; Edwards, 1992; Pawitan, 2001). Similarly, recent monographs on character-izing statistical evidence presents favorable opinions of the likelihood approachbut criticize the lack of general methods for dealing with nuisance parameters(Aitkin, 2010; Evans, 2015). The various justiﬁcations given, however, appear tothe present author to rather vague and unconvincing. For example, suppose wemodiﬁed the above quotation to refer to marginal probability instead of proﬁlelikelihood:

A marginal probability is not a probability, but a probability distribution integratedover nuisance variables given the values of the variables of interest.

The above would be a perfectly ﬁne characterization of a marginal probabilityif the “not a probability, but” part was dropped, i.e.

A marginal probability is a probability distribution integrated over nuisance vari-ables given the values of the variables of interest.

Simply put: the fact that a marginal probability is obtained by integrating overa ‘background’ probability distribution does not prevent the marginal probabilityfrom being a true probability. The crucial observation in the case of marginalprobability is that integration over variables takes probability distributions toprobability distributions .The purpose of the present article is to point out that there is an appropri-ate notion of integration over variables that takes likelihood functions to like-lihood functions via maximization . This notion of integration is based on theidea of idempotent analysis , wherein one replaces a standard algebraic opera-tion such as addition in a given mathematical theory with another basic alge-braic operation, deﬁning a form of ‘idempotent addition’, to obtain a new anal-ogous, self-consistent theory (Maslov, 1992; Kolokoltsov and Maslov, 1997). Inthis case one simply replaces the usual ‘addition’ operations, including the usual(Lebesgue) integration, with ‘maximization’ operations, including taking supre-mums, to obtain a new, ‘idempotent probability theory’. Maximization in thiscontext is understood algebraically as an idempotent addition operation, hencethe terminology. While perhaps somewhat exotic at ﬁrst sight, this idea ﬁndsdirect applications in e.g. large deviation theory (Puhalskii, 2001) and, most rel-evantly, possibility theory, fuzzy set theory and pure-likelihood-based decisiontheory (Dubois, Moral and Prade, 1997; Cattaneo, 2013, 2017). A popular spe-cial instance of idempotent mathematics is so-called ‘tropical mathematics’ inwhich multiplication is also converted to a new algebraic operation, here addition(see e.g. Speyer and Sturmfels, 2009; Akian, Quadrat and Viot, 1996; Litvinov,2007; Pachter and Sturmfels, 2004; Bernhard, 2000). That is, the basic ‘addition’and ‘multiplication’ operations in tropical algebra are interpreted as (max , +),respectively, instead of the usual (+ , × ). With the introduction of a logarithmicdistance in likelihood theory, multiplication of likelihoods becomes addition oflog-likelihoods and we are naturally led to a ‘Tropical Bayesian’ interpretation of(log) proﬁle likelihoods. This provides a formal foundation for the usual intuitiveinterpretation of (negative) log-likelihoods as ‘cost’ measures.The present argument is not, of course, without objections. In particular,acceptance or rejection of the present interpretation depends on what one be-lieves the key properties of likelihood should be; this is, perhaps surprisingly, ROFILE LIKELIHOOD not without signiﬁcant controversy (Bayarri and DeGroot, 1992; Bjørnstad, 1996;Bayarri, DeGroot and Kadane, 1988). Thus we end with a discussion of variouspotential objections, including a discussion of some properties one might wanta general notion of ‘likelihood’ to satisfy and whether the present interpretationdoes or does not satisfy these. Despite potential conﬂicts with some frequentist,evidential and/or Bayesian considerations, we believe that the present interpre-tation is a clear, self-consistent and suitable foundational concept for ‘pure’ like-lihood theory (particularly that developed by Edwards, 1992), and/or for whatwe propose to call ‘Tropical Bayes’.

2. LIKELIHOOD AS A POSSIBILITY MEASURE

Though apparently not well known in the statistical literature, likelihood the-ory is known in the wider literature on uncertainty quantiﬁcation to have anatural correspondence to possibility theory rather than to probability theory(Dubois, Moral and Prade, 1997; Cattaneo, 2013, 2017). This has perhaps beenobscured by the usefulness of likelihood methods as tools in probabilistic statisti-cal inference. It is not our intention to review this wider literature in detail here(see e.g. Dubois, Moral and Prade, 1997; Cattaneo, 2013, 2017; Augustin et al.,2014; Halpern, 2017, for more), but to simply point out the implications of thiscorrespondence. In particular, likelihood theory interpreted as a possibilistic,rather than probabilistic theory can be summarized as:

Probability theory with addition replaced by maximization.

As indicated above, this is sometimes known as, for example, ‘idempotent mea-sure theory’, ‘maxitive measure theory or ‘possibility’ theory, among other names(see e.g. Dubois, Moral and Prade, 1997; Cattaneo, 2013, 2017; Augustin et al.,2014; Halpern, 2017; Maslov, 1992; Kolokoltsov and Maslov, 1997; Puhalskii,2001, for more). This correspondence perhaps explains the preponderance ofmaximization methods in likelihood theory, including the methods of maximumlikelihood and proﬁle likelihood.The most important consequence of this perspective is that the usual Lebesgueintegration with respect to an additive measure, as in probability theory, be-comes, in likelihood/possibility theory, a diﬀerent type of integration, deﬁnedwith respect to a maxitive measure. Again, the key point is simply that additionoperations (including summation and integration) are replaced by maximizationoperations (or taking supremums in general).For completeness, we contrast the key axioms of possibility theory with thoseof probability theory. Given a set of possibilities of Ω, assumed to be discrete forthe moment for simplicity, and for two discrete sets of possibilities

A, B ⊆ Ω thekey axioms of elementary possibility theory are (Halpern, 2017):(2.1) poss( ∅ ) = 0poss(Ω) = 1poss( A ∪ B ) = max { poss( A ) , poss( B ) } which can be contrasted with those of elementary probability theory: MACLAREN (2.2) prob( ∅ ) = 0prob(Ω) = 1prob( A ∪ B ) = sum { prob( A ) , prob( B ) } where A and B are required to be disjoint in the probabilistic case, but this isnot strictly required in the possibilistic case.Given a ‘background’ or ‘starting’ likelihood measure, likelihood theory can bedeveloped as a self-contained theory of possibility, where derived distributions aremanipulated according to the ﬁrst set of axioms above. This is entirely analogousto developing probability theory from a background measure, with derived dis-tributions manipulated according to the second set of axioms. As our intention isto consider methods for obtaining derived distributions by ‘eliminating’ nuisanceparameters, we need not consider here where the starting measure comes from(but see the Discussion).To make the correspondences of interest clear in what follows, we ﬁrst presentprobabilistic marginalization as a special case of a pushforward measure or, equiv-alently, as a special case of a general (not necessarily 1-1) change of variables. Wethen consider the possibilistic analogues.

3. PUSHFORWARD PROBABILITY MEASURES AND THE DELTAFUNCTION METHOD FOR GENERAL CHANGES OF VARIABLE

Given a probability measure µ over a random variable x ∈ R n with associateddensity ρ , deﬁne the new random variable t = T ( x ) where T : R n → R m . Thisvariable is distributed according to the pushforward measure T ⋆ µ , i.e. t ∼ T ⋆ µ .The density of t , here denoted by q = T ⋆ ρ , is conveniently calculated viathe delta function method which is valid for arbitrary changes of variables (notnecessarily 1-1):(3.1) q ( t ) = [ T ⋆ ρ ]( t ) = Z δ ( t − T ( x )) ρ ( x ) dx. As a side point, we note that this method of carrying out arbitrary transforma-tions of variables is standard in statistical physics (see e.g. Van Kampen, 1992),but is apparently less common in statistics (see the articles Au and Tam, 1999;Khuri, 2004, aimed at highlighting this method to the statistical community).

The above means that we can interpret marginalization to a component x ,say, as a special case of a (non-1-1) deterministic change of variables via:(3.2) ρ ( x ) = Z δ ( x − proj X ( x )) ρ ( x ) dx, where proj X ( x ) is simply the projection of x to its ﬁrst coordinate. Thusmarginalization can be thought of as the pushforward under the projection op-erator and as a special case of a general (not necessarily 1-1) change of variables t = T ( x ). ROFILE LIKELIHOOD

4. PROFILE LIKELIHOOD AS MARGINAL POSSIBILITY AND ANEXTENSION TO GENERAL CHANGES OF VARIABLE

As we have repeatedly stressed above, likelihood theory interpreted as a pos-sibilistic, and hence maxitive, measure theory simply means that addition op-erations such as the usual Lebesgue integration are replaced by maximizationoperations such as taking the supremum.Consider ﬁrst then the analogue of a marginal probability density, which wewill call a marginal possibility distribution and denote by L p . Starting from a‘background’ likelihood measure L ( x ) we ‘marginalize’ in the analogous mannerto before:(4.1) L p ( x ) = sup { δ ( x − proj X ( x )) L ( x ) } = sup { x | proj X ( x )= x } { L ( x ) } . This is again simply the pushforward under the projection operator, but hereunder a diﬀerent type of ‘integration’ - i.e. the operation of taking a supremum.Of course, this is just the usual proﬁle likelihood for x .As above, we need not be restricted to marginal possibility distributions: wecan consider arbitrary functions of the parameter t = T ( x ). This leads to ananalogous pushforward operation of L ( x ) to L p ( t ) that we denote by ⋆ p :(4.2) L p ( t ) = [ T ⋆ p L ]( t ) = sup { δ ( t − T ( x )) L ( x ) } = sup { x | T ( x )= t } { L ( x ) } which again corresponds to the usual deﬁnition of proﬁle likelihood.

5. A SIMPLE EXAMPLE COMPARING MARGINAL PROBABILITY ANDMARGINAL POSSIBILITY

Here we consider a simple example illustrating the diﬀerence between prob-abilistic and possibilistic reasoning, in particular under marginalization/non-1-1changes of variable.Suppose you have three suspects in a crime. Through some means or anotheryou decide on the following ‘plausibility’ distribution, where plausibility is usedhere as a general umbrella term for probability and/or possibility reasoning: sus-pect one has plausibility 0.4, while the other two suspects each have plausibility0.3. You also know that suspect one was wearing a red hat at the time of thecrime while the other two were wearing blue hats.According to the above, under a probabilistic interpretation, the most probable perpetrator is suspect one (who wore a red hat); but the most probable hat color of the perpetrator is blue (with probability 0.3 + 0.3 = 0.6). This is a consequenceof the additivity of probability theory and the non-1-1 change of variables in goingfrom suspects to hat colors.On the other hand, if you interpret the given plausibility numbers as a possibil-ity distribution, then according to standard possibility theory the most possiblesuspect is suspect one and the most possible hat colour is now red, i.e. is the hatcolor of the most possible suspect, suspect one. Similarly, this is a consequenceof the maxitivity of possibility theory.The diﬀerence can be made more extreme given a large number of ‘other’suspects, each with low plausibility but sharing some common property that

MACLAREN the main suspect lacks. Again, these results are a simple consequence of howadditivity and maxitivity, respectively, interact with non-1-1 changes of variable(here: person to hat color).We believe that there are reasonable situations where additivity is desirable, but also reasonable situations in which maxitivity might be preferred. This isa subject worth further debate. We note, however, that a relative probability approach to the problem of statistical evidence, such as that presented in Evans(2015) comes to similar conclusions to that of a possibility approach (MichaelEvans, personal communication).

6. STRENGTH OF EVIDENCE, DISTANCES AND ‘TROPICAL BAYES’

As noted in Evans (2015), it is perhaps less controversial to hold that likeli-hood gives a qualitatively reasonable relative ordering of preference for parametervalues in light of data than it is to hold (e.g. Royall, 1997) that it provides a quan-titative measure of relative support.To make some progress towards addressing this distinction, we consider how todeﬁne a suitable notion of distance that respects - but is distinct from - a given qualitative ordering . Notions of statistical distance are common in the statisticalliterature (see e.g. Basu, Shioya and Park, 2011, and references therein); here,however, we follow the ideas developed by Tarantola (2006) of quality spaces anddistances deﬁned in these. This leads naturally to the idea of pure likelihoodtheory as a form of what we propose to call ‘Tropical Bayes’, where the meaningof this term is discussed below.In particular, given the ordering induced by a likelihood function (and/orproﬁle likelihood function):(6.1) θ is preferred to θ iﬀ L ( θ ) > L ( θ ) , we can deﬁne a likelihood distance via(6.2) D L ( θ , θ ) = | log L ( θ ) L ( θ ) | = | log L ( θ ) L ( θ ) | . This distance has the properties of being symmetric, additive and zero iﬀ L ( θ ) = L ( θ ). Tarantola (2006) argues that this notion of distance is widelyapplicable for many types of qualitative orderings. In the present case it is, ofcourse, just the well-known log-likelihood ratio function. We propose then that,accepting that the likelihood gives a natural qualitative preference or plausibil-ity ordering , the log-likelihood then gives a natural distance in this ‘qualitativespace’. There remains, however, a choice of logarithm base and/or a choice ofarbitrary distance scale factor; thus we can’t fully remove some of the ‘quali-tative’ features associated with pure likelihood theory without a further choiceof reference. One natural choice might be to take the minimum distance to afully saturated model, i.e. one which can ﬁt the data perfectly, in which case onewould be interested in how much ‘ﬁt’ to trade-oﬀ against parsimony considera-tions (Edwards, 1992). ROFILE LIKELIHOOD Interestingly, the combination of replacing addition operations by maximiza-tion and then working in log-space (wherein multiplication becomes addition) cor-responds to completing the ‘tropicalization’ of probability theory: moving froman algebraic structure in terms of (+ , × ) to one in terms of (max , +). This isthe subject of ‘tropical algebra’, which also goes by the name ‘max-plus’ alge-bra, and is a popular special instance of idempotent mathematics with applica-tions to decision theory, uncertainty quantiﬁcation, statistical inference and op-timization (see e.g. Speyer and Sturmfels, 2009; Akian, Quadrat and Viot, 1996;Litvinov, 2007; Pachter and Sturmfels, 2004; Bernhard, 2000, for some relevantstarting points in this area). A natural interpretation of negative log-likelihoodfunctions in this context is as ‘cost measures’; these have also been termed‘Maslov measures’, due to their origins in Maslov’s idempotent probability theory(Akian, Quadrat and Viot, 1996; Bernhard, 2000). These analogies are exploredin detail by Akian, Quadrat and Viot (1996), where the natural analogue of arandom variable is a decision variable, the analogue of a Markov chain is a Bell-man chain (i.e. the Bellman equation from the subject of dynamic programming)and so on.Finally, however, we note that even if proﬁle likelihood is accepted as thenatural analogue of marginal probability, the evidential interpretation of proﬁlelikelihood may still have diﬃculties; this is discussed further below.

7. DISCUSSION7.1 Objections to proﬁle likelihood

As discussed, it is frequently asserted that proﬁle likelihood is not a true like-lihood (Aitkin, 2005; Royall, 1997; Pawitan, 2001; Rohde, 2014; Evans, 2015).Common reasons include: that it is obtained from a likelihood via maximiza-tion (Aitkin, 2005), that it is not based directly on observable quantities (Royall,1997; Pawitan, 2001; Rohde, 2014) and that it lacks particular repeated samplingproperties (Royall, 1997; Cox and Barndorﬀ-Nielsen, 1994).None of the above objections appear to the present author to apply to thefollowing: given a starting or ‘background’ likelihood function, proﬁle likelihoodsatisﬁes the axioms of possibility theory, in which the basic additivity axiom ofprobability theory is replaced by a maxitivity axiom. Proﬁle likelihood is simplythe natural possibilistic counterpart to marginal probability, where additive inte-gration is replaced by a maxitive analogue. We thus argue that, if marginal prob-ability is a ‘true’ probability, then proﬁle likelihood should likewise be considereda ‘true’ likelihood, at least when likelihood theory is interpreted in a possibilisticmanner. Negative log-likelihood functions can then be naturally interpreted ascost measures in the sense of tropical mathematics.

Regarding the second two objections mentioned above: observable quantitiesand repeated sampling properties, it is important to note that the given datamust be held ﬁxed to give a consistent background likelihood over which to pro-ﬁle. Given ﬁxed data one has a ﬁxed possibility measure and thus can consider‘marginal’ - i.e. proﬁle - likelihoods. In contrast, repeated sampling will produce adistribution of such possibility measures, and these may or may not have good fre-quentist properties. None of this is in contrast to marginal probability: changing

MACLAREN the distribution over which we marginalize changes the resulting marginal prob-ability. Of course, despite this caveat, proﬁle likelihood often does have good re-peated sampling properties (Royall, 1997; Cox and Barndorﬀ-Nielsen, 1994) andalso plays a key role in frequentist theory, though we do not discuss this furtherhere. One consequence is that our conception of proﬁle likelihood does not gen-erally satisfy properties such as zero expectation of the associated score function(Cox and Barndorﬀ-Nielsen, 1994; Pawitan, 2001). These are, however, propertiesdependent on particular repeated sampling notions such as ‘unbiasedness’, andhence more properly considered as frequentist concepts. The present approachis more suitable for those seeking a non-probabilistic ‘plausibility’ measure, asinduced by data that are considered ﬁxed once observed.

A natural question, perhaps, is why worry about whether proﬁle likelihoodis a true likelihood? One answer is that proﬁle likelihood is a widely used toolbut is often dismissed as ‘ad-hoc’ or lacking proper justiﬁcation. This gives theimpression that, for example, likelihood theory is lacking in comparison with e.g.Bayesian theory in terms of systematic methods for dealing with nuisance pa-rameters. By understanding that proﬁle likelihood does in fact have a systematicbasis in terms of possibility theory practitioners and students can better under-stand and reason about a widely popular and useful tool. Understanding theconnection to possibilistic as opposed to probabilistic reasoning may also helpexplain why proﬁle likelihood has emerged as a particularly promising method of identiﬁability analysis (Raue et al., 2009), where identiﬁability is traditionally aprerequisite for probabilistic analysis. Of course, as indicated, the price of accept-ing proﬁle likelihood as a ‘true’ likelihood is an interpretation in terms of purelikelihood theory, and this makes the connections to repeated sampling propertiesmore complicated. We see no need however, to restrict oneself to one perspec-tive on statistical inference - the present possibilistic view can complement otherapproaches such as frequentist statistics or Bayesian statistics. Furthermore, thisanalogy opens strong connections between likelihood theory and the optimizationliterature; the foundations of such connections have already been explored by e.g.Akian, Quadrat and Viot (1996); Bernhard (2000) and provide a natural link topure likelihood decision theory as developed by Cattaneo (2013).

The possibilistic interpretation of likelihood also helps understand the rep-resentation of ignorance. While probabilistic ignorance is not preserved under arbitrary changes of variables (e.g. non-1-1 transformations), even in the discretecase, possibilistic ignorance is in the following sense: if we take the maximumlikelihood over a set of possibilities, such as { x | T ( x ) = t } for each t , rather thansumming them, a ﬂat ‘prior likelihood’ (Edwards, 1969, 1992) over x becomesa ﬂat prior likelihood over t . On the other hand, a ﬂat prior probability over x in general becomes non-ﬂat over t under non-1-1 changes of variable. Thus aproﬁle prior likelihood has what, in many cases, may be desirable properties as arepresentation of prior ignorance (see the discussion in Edwards, 1969, 1992, formore on likelihood and the representation of ignorance). This diﬀerence in trans-formation properties was also illustrated in our simple example comparing theprobabilistic and possibilistic analysis of criminal evidence. As noted there, how- ROFILE LIKELIHOOD ever, the relative probabilistic approach a la Evans (2015), reaches conclusionscloser to the possibilistic analysis, compared to the conclusions of the ‘absolute’probabilistic analysis (Michael Evans, personal communication).

Likelihood is traditionally considered a point function as opposed to a setfunction; this is also related to controversy over deﬁning likelihood functions forso-called composite hypotheses (see e.g. Edwards, 1992; Royall, 1997). Authorssuch as Basu (2012) have argued, contra e.g. Fisher, that likelihood could bedirectly extended to a set function. Basu (2012) further developed the argumentthat this set function could be taken as additive - we are more inclined, here atleast, to consider the ﬁrst possibility, and reject the second. A number of otherauthors have also considered the question of composite hypotheses, in particularin the context of deﬁning evidence (see e.g. Zhang and Zhang, 2013; Blume, 2013;Bickel, 2012).We have attempted to avoid the issue of set functions/composite hypothesessomewhat by instead using the concept of a non-1-1 transformation of variables.This allows us to consider the likelihood of subsets of the full/background pa-rameter space based on an indexing statistic, i.e. by using subsets deﬁned via { x | T ( x ) = t } . This approach is based on what amounts to equality constraints,leaving out subsets deﬁned via inequality constraints. It may be desirable to fur-ther relax this and simply consider likelihood directly as a set function deﬁnedvia(7.1) L p ( A ) = sup x ∈ A { L ( x ) } for A ⊆ X . This allows for inequality constraints such as those in A = { x | T ( x ) ≤ t } .We leave consideration of this approach to future work. Presumably, however,one could recover the present approach by considering some notion of minimaland/or extremal sets of equality constraints, e.g. by restricting attention to thoseinequality constraints that are active during the proﬁling/maximization proce-dure, and hence those that are reduced to binding equality constraints. The in-terpretation of negative log-likelihoods as cost measures may also be helpful here. One of the key issues to consider when deciding whether to accept proﬁle like-lihoods as ‘true’ likelihoods is whether they can play the same role that ‘full’ like-lihoods play in deﬁning evidential measures (Royall, 1997; Aitkin, 2010; Evans,2015; Zhang and Zhang, 2013; Blume, 2013; Bickel, 2012). Mathematically, it ap-pears clear that proﬁle likelihood is entirely analogous to marginal probability;it is less clear whether - or under what circumstances - one should use marginal(whether maxitive or additive) measures in deﬁning evidence . We believe thatthis applies equally to the Bayesian approach. A way forward from here would beto separate the questions: ﬁrst accept proﬁle likelihood as a ‘marginal’ possibil-ity measure, and then investigate under what circumstances marginal measurescan be given further evidential interpretations. We suspect that the answer mayrequire additional concepts and/or assumptions like those used in the causal infer-ence literature to separate spurious marginal associations from ‘true’ causation MACLAREN (Pearl, 2009a,b). That is, we suspect that ‘evidence’ may be better deﬁned incausal terms than in either purely probabilistic or purely possibilistic terms. Assuch, the question of whether or not proﬁle likelihood is a ‘true’ likelihood shouldbe independent of whether it plays the role of an evidential measure, unless thedeﬁnition of likelihood is itself explicitly supplemented with causal assumptions.

8. CONCLUSIONS

We have argued that proﬁle likelihood has as much claim to being a true likeli-hood as a marginal probability has to being a true probability distribution. In thecase of marginal probability, integration over variables takes probability distri-butions to probability distributions, while in the case of likelihood, maximizationtakes likelihood functions to likelihood functions . Maximization can be consideredin this context as an alternative (idempotent) notion of integration, and a like-lihood function as a maxitive possibility measure. There are some conﬂicts withboth Bayesian and frequentist considerations, however: lack of additivity and lackof some repeated sampling properties, respectively. In our view, these conﬂicts arenot necessarily an issue, as neither additivity nor repeated sampling propertiessuch as unbiasedness are beyond objections. Instead we argue that the present ap-proach gives a self-consistent theory suitable for possibilistic statistical analysis,with a well-deﬁned method of treating nuisance parameters, and which continuesin the tradition of ‘pure’ likelihood theories. The connection of proﬁle likelihoodsto evidential interpretations appears subtle (as is, we believe, the connection ofmarginal probabilities to evidence); our view is that this issue should be exploredfurther in the context of formulating additional causal properties that an evidencemeasure should satisfy, such as those required to classify marginal correlationsinto ‘spurious’ and ‘true’ causal relationships. Finally, taking proﬁle likelihoodseriously as a ‘true’ likelihood leads naturally to the idea of ‘Tropical BayesianInference’, a subject yet to be properly explored by the statistical community.

ACKNOWLEDGEMENTS

The author would like to thank Michael Evans, Marco Cattaneo, Yudi Pawitan,Alexandre Patriota, Christian Robert and Anthony Edwards for useful commentsand/or discussions.

REFERENCES

Aitkin, M. (2005). Proﬁle Likelihood. In

Encyclopedia of Biostatistics

John Wiley & Sons, Ltd.

Aitkin, M. (2010).

Statistical Inference: An Integrated Bayesian/Likelihood Approach . Chapman& Hall/CRC Monographs on Statistics & Applied Probability . CRC Press.

Akian, M. , Quadrat, J. P. and

Viot, M. (1996). Duality between probability and optimiza-tion. In

Idempotency (J. Gunawardena, ed.) Cambridge University Press.

Au, C. and

Tam, J. (1999). Transforming Variables Using the Dirac Generalized Function.

Am.Stat. Augustin, T. , Coolen, F. P. A. , de Cooman, G. and Troffaes, M. C. M. (2014).

Intro-duction to imprecise probabilities . John Wiley & Sons.

Basu, D. (2012).

Statistical Information and Likelihood: A Collection of Critical Essays by Dr.D. Basu . Springer Science & Business Media.

Basu, A. , Shioya, H. and

Park, C. (2011).

Statistical Inference: The Minimum DistanceApproach . CRC Press.

Bayarri, M. , DeGroot, M. and

Kadane, J. (1988). What is the likelihood function? (withdiscussion).

Statistical Decision Theory and Related Topics IV. (eds. SS Gupta and JO Berger)Springer, New York Bayarri, M. J. and

DeGroot, M. H. (1992). Diﬃculties and ambiguities in the deﬁnition ofa likelihood function.

J. It. Statist. Soc. Bernhard, P. (2000). Max-Plus Algebra and Mathematical Fear in Dynamic Optimization.

Set-Valued Analysis Bickel, D. R. (2012). The Strength Of Statistical Evidence For Composite Hypotheses: Infer-ence To The Best Explanation.

Stat. Sin. Bjørnstad, J. F. (1996). On the Generalization of the Likelihood Function and the LikelihoodPrinciple.

J. Am. Stat. Assoc. Blume, J. D. (2013). Likelihood and Composite Hypotheses [Comment on “A LikelihoodParadigm for Clinical Trials”].

J. Stat. Theory Pract. Cattaneo, M. E. G. (2013). Likelihood decision functions.

Electron. J. Stat. Cattaneo, M. E. G. V. (2017). The likelihood interpretation as the foundation of fuzzy settheory.

Int. J. Approx. Reason.

Cox, D. R. and

Barndorff-Nielsen, O. E. (1994).

Inference and asymptotics . Chapman andHall, London.

Dubois, D. , Moral, S. and

Prade, H. (1997). A Semantics for Possibility Theory Based onLikelihoods.

J. Math. Anal. Appl.

Edwards, A. W. F. (1969). Statistical methods in scientiﬁc inference.

Nature

Edwards, A. W. F. (1992). Likelihood, expanded ed.

Johns Hopkins University Press, Balti-more . Evans, M. (2015).

Measuring Statistical Evidence Using Relative Belief . Chapman & Hall/CRCMonographs on Statistics & Applied Probability . CRC Press.

Halpern, J. Y. (2017).

Reasoning about Uncertainty . MIT Press.

Khuri, A. I. (2004). Applications of Dirac’s delta function in statistics.

Internat. J. Math. Ed.Sci. Tech. Kolokoltsov, V. and

Maslov, V. P. (1997).

Idempotent Analysis and Its Applications .Springer Science & Business Media.

Litvinov, G. L. (2007). Maslov dequantization, idempotent and tropical mathematics: A briefintroduction.

J. Math. Sci.

Maslov, V. P. (1992).

Idempotent Analysis . American Mathematical Soc.

Pachter, L. and

Sturmfels, B. (2004). Tropical geometry of statistical models.

Proc. Natl.Acad. Sci. U. S. A.

Pawitan, Y. (2001).

In All Likelihood: Statistical Modelling and Inference Using Likelihood . Oxford science publications . OUP Oxford.

Pearl, J. (2009a). Causal inference in statistics: An overview.

Stat. Surv. Pearl, J. (2009b).

Causality . Cambridge University Press.

Puhalskii, A. (2001).

Large Deviations and Idempotent Probability . CRC Press.

Raue, A. , Kreutz, C. , Maiwald, T. , Bachmann, J. , Schilling, M. , Klingm¨uller, U. and

Timmer, J. (2009). Structural and practical identiﬁability analysis of partially observeddynamical models by exploiting the proﬁle likelihood.

Bioinformatics Rohde, C. A. (2014).

Introductory Statistical Inference with the Likelihood Function: . SpringerInternational Publishing.

Royall, R. (1997).

Statistical Evidence: A Likelihood Paradigm . CRC Press.

Speyer, D. and

Sturmfels, B. (2009). Tropical Mathematics.

Math. Mag. . Tarantola, A. (2006).

Elements for Physics: Quantities, Qualities, and Intrinsic Theories .Springer Science & Business Media.

Van Kampen, N. G. (1992).

Stochastic processes in physics and chemistry . Elsevier. Zhang, Z. and

Zhang, B. (2013). A Likelihood Paradigm for Clinical Trials.

J. Stat. TheoryPract.7