Likelihood principle
From Academic Kids

In statistics, the likelihood principle is a controversial principle of statistical inference which asserts that all of the information in a sample is contained in the likelihood function.
A likelihood function is a conditional probability distribution considered as a function of its second argument, holding the first fixed. For example, consider a model which gives the probability density function of observable random variable X as a function of a parameter θ. Then for a specific value x of X, the function L(θ  x) = P(X=x  θ) is a likelihood function of θ. Two likelihood functions are equivalent if one is a scalar multiple of the other; according to the likelihood principle, all information from the data relevant to inferences about the value of θ is found in the equivalence class.
Contents 
Example
Suppose
 X is the number of successes in twelve independent Bernoulli trials with probability θ of success on each trial, and
 Y is the number of independent Bernoulli trials needed to get three successes, again with probability θ of success on each trial.
Then the observation that X = 3 induces the likelihood function
 <math>L(\thetaX=3)=220\;\theta^3(1\theta)^9<math>
and the observation that Y = 12 induces the likelihood function
 <math>L(\thetaY=12)=55\;\theta^3(1\theta)^9.<math>
These are equivalent because each is a scalar multiple of the other. The likelihood principle therefore says the inferences drawn about the value of θ should be the same in both cases.
The difference between observing X = 3 and observing Y = 12 is only in the design of the experiment: in one case, one has decided in advance to try twelve times; in the other, to keep trying until three successes are observed. The outcome is the same in both cases. Therefore the likelihood principle is sometimes stated by saying:
 The inference should depend only on the outcome of the experiment, and not on the design of the experiment.
The law of likelihood
A related concept is the law of likelihood, the notion that the extent to which the evidence supports one parameter value or hypothesis against another is equal to the ratio of their likelihoods. That is,
 <math>\Lambda = {L(aX=x) \over L(bX=x)} = {P(X=xa) \over P(X=xb)}<math>
is the degree to which the observation x supports parameter value or hypothesis a against b. If this ratio is 1, the evidence is indifferent, and if greater or less than 1, the evidence supports a against b or vice versa. The use of Bayes factors can extend this by taking account of the complexity of different hypotheses.
Combining the likelihood principle with the law of likelihood yields the consequence that the parameter value which maximizes the likelihood function is the value which is most strongly supported by the evidence. This is the basis for the widelyused method of maximum likelihood.
Historical remarks
The likelihood principle was first identified by that name in print in 1962 (Barnard et al., Birnbaum, and Savage et al.), but arguments for the same principle, unnamed, and the use of the principle in applications goes back to the works of R.A. Fisher in the 1920s. The law of likelihood was identified by that name by I. Hacking (1965). More recently the likelihood principle as a general principle of inference has been championed by Anthony W. F. Edwards. The likelihood principle has been applied to the philosophy of science by R. Royall.
Arguments for and against the likelihood principle
The likelihood principle is not universally accepted. Some widelyused methods of conventional statistics, for example many significance tests, are not consistent with the likelihood principle. Let us briefly consider some of the arguments for and against the likelihood principle.
Experimental design arguments on the likelihood principle
Unrealized events do play a role in some common statistical methods. For example, the result of a significance test depends on the probability of a result as extreme or more extreme than the observation, and that probability may depend on the design of the experiment. Thus, to the extent that such methods are accepted, the likelihood principle is denied.
Some classical significance tests are not based on the likelihood. A commonly cited example is the optional stopping problem. Suppose I tell you that I tossed a coin 12 times and in the process observed 3 heads. You might make some inference about the probability of heads and whether the coin was fair. Suppose now I tell that I tossed the coin until I observed 3 heads, and I tossed it 12 times. Will you now make some different inference?
The likelihood function is the same in both cases: it is proportional to
 <math>p^3 \; (1p)^9.<math>
According to the likelihood principle, the inference should be the same in either case. But this may seem to be dubious; it seem possible to argue to a foregone conclusion by simply tossing a coin enough times until a desired result was achieved. Apparently paradoxical results of this kind are considered by some as arguments against the likelihood principle; for others it exemplifies its value and resolves the paradox.
Suppose a number of scientists are assessing the probability of a certain outcome (which we shall call 'success') in experimental trials. Conventional wisdom suggests that if is there is no bias towards success or failure then the success probability would be one half. Adam, a scientist, conducted 12 trials and obtains 3 successes and 9 failures. Then he dropped dead.
Bill, a colleague in the same lab, continued Adam's work and published Adam's results, along with a significance test. He tested the null hypothesis that p, the success probability, is equal to a half, versus p < 0.5. The probability of the observed result that out of 12 trials 3 or something fewer (i.e. more extreme) were successes, if H0 is true, is
 <math>\left({12 \choose 9}+{12 \choose 10}+{12 \choose 11}+{12 \choose 12}\right)\left({1 \over 2}\right)^{12}<math>
which is 299/4096 = 7.3%. Thus the null hypothesis is not rejected at the 5% significance level.
Charlotte, another scientist, reads Bill's paper and writes a letter, saying that it is possible that Adam kept trying until he obtained 3 successes, in which case the probability of needing to conduct 12 or more experiments is given by
 <math>1\left({10 \choose 2}\left({1 \over 2}\right)^{11}+{9 \choose 2}\left({1 \over 2}\right)^{10}+\cdots +{2 \choose 2}\left({1 \over 2}\right)^{3}\right)<math>
which is 134/4096 = 3.27%. Now the result is statistically significant at the 5% level.
To these scientists, whether a result is significant or not seems to depend on the original design of the experiment, not just the likelihood of the outcome.
Bayesian arguments on the likelihood principle
From a Bayesian point of view, the likelihood principle is a direct consequence of Bayes' theorem. An observation A enters the formula,
 <math>P(BA) = \frac{P(AB)\;P(B)}{P(A)}
= \frac{P(AB)\;P(B)}{\sum_{B'}P(AB')\;P(B')}<math>
only through the likelihood function <math>P(AB)<math>.
In general, observations come into play through the likelihood function, and only through the likelihood function; the information content of the data is entirely expressed by the likelihood function. Furthermore, the likelihood principle implies that any event that did not happen has no effect on an inference, since if an unrealized event does affect an inference then there is some information not contained in the likelihood function. Thus, Bayesians accept the likelihood principle and reject the use of frequentist significance tests. As one leading Bayesian, Harold Jeffreys, described the use of significance tests: "A hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred."
Bayesian analysis is not always consistent with the likelihood principle. Jeffreys suggested in 1961 a noninformative prior distribution based on a density proportional to I(θ)^{−1/2} where I(θ) is the Fisher information matrix; this, known as the Jeffreys prior, can fail the likelihood principle as it may depend on the design of the experiment. More dramatically, the use of the BoxCox transformation may lead to a prior which is data dependent.
Optional stopping in clinical trials
The fact that Bayesian and frequentist arguments differ on the subject of optional stopping has a major impact on the way that clinical trial data can be analysed. In frequentist setting there is a major difference between a design which is fixed and one which is sequential, i.e. consisting of a sequence of analyses. Bayesian statistics is inherently sequential and so there is no such distinction.
In a clinical trial it is strictly not valid to conduct an unplanned interim analysis of the data by frequentist methods, whereas this is permissible by Bayesian methods. Similarly, if funding is withdrawn part way through an experiment, and the analyst must work with incomplete data, this is a possible source of bias for classical methods but not for Bayesian methods, which do not depend on the intended design of the experiment. Furthermore, as mentioned above, frequentist analysis is open to unscrupulous manipulation if the experimenter is allowed to choose the stopping point, whereas Bayesian methods are immune to such manipulation.
References
 G.A. Barnard, G.M. Jenkins, and C.B. Winsten. "Likelihood Inference and Time Series", J. Royal Statistical Society, series A, 125:321372, 1962.
 Allan Birnbaum. "On the foundations of statistical inference". J. Amer. Statist. Assoc. 57(298):269–326, 1962. (With discussion.)
 Anthony W.F. Edwards. Likelihood. 1st edition 1972 (Cambridge University Press), 2nd edition 1992 (Johns Hopkins University Press).
 Anthony W.F. Edwards. "The history of likelihood". Int. Statist. Rev. 42:915, 1974.
 Ronald A. Fisher. "On the Mathematical Foundations of Theoretical Statistics", Phil. Trans. Royal Soc., series A, 222:326, 1922. (On the web at: [1] (http://www.library.adelaide.edu.au/digitised/fisher/18pt1.pdf))
 Ian Hacking. Logic of Statistical Inference. Cambridge University Press, 1965.
 Berger J.O., and Wolpert, R.L, (1988). "The Likelihood Principle". The Institute of Mathematical Statistics, Haywood, CA.
 Harold Jeffreys, The Theory of Probability. The Oxford University Press, 1961.
 Richard M. Royall. Statistical Evidence: A Likelihood Paradigm. London: Chapman & Hall, 1997.
 Leonard J. Savage et al. The Foundations of Statistical Inference. 1962.
External links
 Anthony W.F. Edwards. "Likelihood". http://www.cimat.mx/reportes/enlinea/D9910.html
 Jeff Miller. Earliest Known Uses of Some of the Words of Mathematics (L) (http://members.aol.com/jeff570/l.html)
 John Aldrich. Likelihood and Probability in R. A. Fisher’s Statistical Methods for Research Workers (http://www.economics.soton.ac.uk/staff/aldrich/fisherguide/prob+lik.htm)