## Notes to Inductive Logic

1.
Although enumerative inductive arguments may seem to be similar to what
classical statisticians call *estimation*, it is not really
the same thing. As classical statisticians are quick to point out,
*estimation* does not use the sample to *inductively
support* a conclusion about the whole population.
*Estimation* is not supposed to be a kind of inductive
inference. Rather, *estimation* is a decision strategy. The
sample frequency will be within two standard deviations of the
population frequency in about 95% of all samples. So, if one adopts
the strategy of *accepting as true* the claim that the
population frequency is within two standard deviations of the sample
frequency, and if one uses this strategy repeatedly for various
samples, one should be right about 95% of the time. I discuss
enumerative induction in much more detail in the
supplement on Enumerative Inductions: Bayesian Estimation and Convergence,
which treats Bayesian convergence and the satisfaction of the CoA for
the special case of enumerative inductions.

2. Here are the more usual axioms for conditional probability functions. These axioms are provably equivalent to the axioms presented in the main text.

A support function is a function \(P_{\alpha}\) from pairs of sentences ofLto real numbers between 0 and 1 (inclusive) that satisfies the following axioms:

- (1)\(P_{\alpha}[D \pmid E] \ne 1\) for at least one pair of sentences
DandE.For all sentence \(A, B\), and

C,

- (2)If \(B \vDash A\), then \(P_{\alpha}[A \pmid B] = 1\);
- (3) If \(B \vDash C\) and \(C \vDash B\), then \(P_{\alpha}[A \pmid B] = P_{\alpha}[A \pmid C]\);
- (4) If \(C \vDash{\nsim}(B\cdot A)\), then either \[P_{\alpha}[(A \vee B) \pmid C] = P_{\alpha}[A \pmid C] + P_{\alpha}[B \pmid C]\] or else \(P_{\alpha}[D \pmid C] = 1\) for every
D;- (5)\(P_{\alpha}[(A\cdot B) \pmid C] = P_{\alpha}[A \pmid (B\cdot C)] \times P_{\alpha}[B \pmid C]\).

The results stated in the main text derive from the axioms in the main text as follows:

- If \(B \vDash A\), then \(P_{\alpha}[A \pmid B] = 1\).
**Proof:**First notice that (by axiom 2) all logical entailments must have the same support value: for, when \(B \vDash A\) and \(D \vDash C\) we have \(P_{\alpha}[A \pmid B] \ge P_{\alpha}[C \pmid D] \ge P_{\alpha}[A \pmid B]\); so \(P_{\alpha}[C \pmid D] = P_{\alpha}[A \pmid B]\). Thus, all logical entailments \(B \vDash A\) must have the same real number support value \(r: P_{\alpha}[A \pmid B] = r\) whenever \(B \vDash A\). To see that*r*must equal either 1 or 0, use axiom 5 as follows (employing the various logical entailments involved):so \(r = r \times r\); so \(r = 1\) or \(r = 0\). To see that \(r \ne 0\), let us suppose \(r = 0\) and derive a contradiction. Axiom 1 together with axiom 2 requires that for some

\[0 = r = P_{\alpha}[(A \vee{\nsim}A) \pmid B] = P_{\alpha}[A \pmid B] + P_{\alpha}[{\nsim}A \pmid B]\]*A*and*B*, \(P_{\alpha}[A \pmid B] \lt P_{\alpha}[A \pmid A] = r = 0\). Then, for this*A*and*B*, since \(B \vDash(A \vee{\nsim}A)\) and \(B \vDash{\nsim}(A \cdot{\nsim}A)\), from axiom 4 we have(or else \(P_{\alpha}[C \pmid B] = P_{\alpha}[B \pmid B] = 0\) for every sentence

*C*, which contradicts our supposition that \(P_{\alpha}[A \pmid B] \lt 0\)). But then we must have \(P_{\alpha}[{\nsim}A \pmid B] \gt 0 = r = P_{\alpha}[A \pmid A]\), which contradicts axiom 2. -
If \(C \vDash{\nsim}(B\cdot A)\), then either

\[P_{\alpha}[(A \vee B) \pmid C] = P_{\alpha}[A \pmid C] + P_{\alpha}[B \pmid C]\]or else \(P_{\alpha}[E \pmid C] = 1\) for every sentence

*E*.**Proof:**Since \(E \vDash E\), \(P_{\alpha}[E \pmid E] = 1\). Thus, axiom 4 becomes: If \(C \vDash{\nsim}(B\cdot A)\), then eitheror else \(P_{\alpha}[E \pmid C] = 1\) for every sentence

*E*. - \(P_{\alpha}[{\nsim}A \pmid B] = 1 - P_{\alpha}[A \pmid B]\) or
else \(P_{\alpha}[C \pmid B] = 1\) for every sentence
*C*.**Proof:**\(B \vDash(A \vee{\nsim}A)\) and \(B \vDash{\nsim}(A \cdot{\nsim}A)\), soor else \(P_{\alpha}[C \pmid B] = 1\) for every sentence

*C*. Thus, \(P_{\alpha}[{\nsim}A \pmid B] = 1 - P_{\alpha}[A \pmid B]\) or else \(P_{\alpha}[C \pmid B] = 1\) for every sentence*C*. - \(1 \ge P_{\alpha}[A \pmid B] \ge 0\).
**Proof:**Result 1 (above) together with axiom 2 implies that \(1 \ge P_{\alpha}[A \pmid B]\), for all \(A, B\). Suppose that for some*A*and*B*, \(P_{\alpha}[A \pmid B] \lt 0\); thencontradiction! So we must have \(P_{\alpha}[A \pmid B] \ge 0\). Thus, for all sentences

*A*and*B*, \(1 \ge P_{\alpha}[A \pmid B] \ge 0\). - If \(B \vDash A\), then \(P_{\alpha}[A \pmid C] \ge P_{\alpha}[B
\pmid C]\).
**Proof:**Suppose \(B \vDash A\). Then \(\vDash{\nsim}(B \cdot{\nsim}A)\), soThus, \(P_{\alpha}[A \pmid C] \ge P_{\alpha}[B \pmid C]\).

- If \(B \vDash A\) and \(A \vDash B\), then \(P_{\alpha}[A \pmid C]
= P_{\alpha}[B \pmid C]\).
**Proof:**Suppose \(B \vDash A\) and \(A \vDash B\). Then, from the previous result we have \(P_{\alpha}[A \pmid C] = P_{\alpha}[B \pmid C]\). - If \(C \vDash B\), then \(P_{\alpha}[(A\cdot B) \pmid C] =
P_{\alpha}[(B\cdot A) \pmid C] = P_{\alpha}[A \pmid C]\).
**Proof:**Suppose \(C \vDash B\). Then, \((A\cdot C) \vDash B\), so \(P_{\alpha}[B \pmid (A\cdot C)] = 1\), by result 1 above. Then, by axiom 5,Thus, since \((A\cdot B) \vDash(B\cdot A)\) and \((B\cdot A) \vDash(A\cdot B)\), result 6 above yields

\[\begin{align} P_{\alpha}[(A\cdot B) \pmid C] & = P_{\alpha}[(B\cdot A) \pmid C] \\ &= P_{\alpha}[A \pmid C].\end{align}\] - If \(C \vDash B\) and \(B \vDash C\), then \(P_{\alpha}[A \pmid B]
= P_{\alpha}[A \pmid C]\).
**Proof:**Suppose \(C \vDash B\) and \(B \vDash C\). Then, from previous results together with axioms 5 and 3 we have - If \(P_{\alpha}[B \pmid C] \gt 0\), then
\[
P_{\alpha}[A \pmid (B\cdot C)] = P_{\alpha}[B \pmid (A\cdot C)] \times \frac{P_{\alpha}[A \pmid C]}{P_{\alpha}[B \pmid C]}
\]
(simple form
of Bayes’ theorem).
**Proof:**Axioms 5 together with result 6 yieldsThus, if \(P_{\alpha}[B \pmid C] \gt 0\), then

\[ P_{\alpha}[A \pmid (B\cdot C)] = P_{\alpha}[B \pmid (A\cdot C)] \times \frac{P_{\alpha}[A \pmid C]}{P_{\alpha}[B \pmid C]}. \] - \(P_{\alpha}[(A\vee B) \pmid C] = P_{\alpha}[A \pmid C] +
P_{\alpha}[B \pmid C] - P_{\alpha}[(A\cdot B) \pmid C]\).
**Proof:**If \(P_{\alpha}[D \pmid C] = 1\) for every*D*, then each term in the result is 1, so the result holds. So, let’s suppose that \(P_{\alpha}[D \pmid C] \ne 1\) for some*D*. Notice thatand

\[(A \vee({\nsim}A \cdot B)) \vDash(A \vee B),\]and also

\[C \vDash{\nsim}(A \cdot({\nsim}A \cdot B)).\]We’ll also be using the fact that

\[B \vDash((A \cdot B)\vee({\nsim}A \cdot B))\]and

\[((A \cdot B)\vee({\nsim}A \cdot B)) \vDash B,\]and also

\[C \vDash{\nsim}((A \cdot B)\cdot({\nsim}A \cdot B)).\]Then

\[ \begin{align} P_{\alpha}[(A \vee B) \pmid C] & = P_{\alpha}[(A \vee({\nsim}A\cdot B)) \pmid C]\\ & = P_{\alpha}[A \pmid C] + P_{\alpha}[({\nsim}A\cdot B) \pmid C]\\ & = P_{\alpha}[A \pmid C] + P_{\alpha}[({\nsim}A\cdot B) \pmid C] + P_{\alpha}[B \pmid C] \\ &\qquad - P_{\alpha}[B \pmid C]\\ & = P_{\alpha}[A \pmid C] + P_{\alpha}[({\nsim}A\cdot B) \pmid C] + P_{\alpha}[B \pmid C] \\ &\qquad - P_{\alpha}[(A\cdot B)\vee({\nsim}A\cdot B) \pmid C]\\ & = P_{\alpha}[A \pmid C] + P_{\alpha}[({\nsim}A\cdot B) \pmid C] + P_{\alpha}[B \pmid C] \\ &\qquad - P_{\alpha}[(A\cdot B) \pmid C] - P_{\alpha}[({\nsim}A\cdot B) \pmid C]\\ & = P_{\alpha}[A \pmid C] + P_{\alpha}[B \pmid C] - P_{\alpha}[(A\cdot B) \pmid C]. \end{align} \] - If \(\{B_1 , \ldots ,B_n\}\) is any finite set of sentences such
that for each pair \(B_i\) and \(B_j, C \vDash{\nsim}(B_{i}\cdot
B_{j})\) (i.e., the members of the set are mutually exclusive, given
*C*), then either \(P_{\alpha}[D \pmid C] = 1\) for every sentence*D*, or \[ P_{\alpha}[((B_1\vee B_2)\vee \ldots \vee B_n) \pmid C] = \sum^{n}_{i=1} P_{\alpha}[B_i \pmid C]. \]**Proof:**For each distinct*i*and*j*, let \(C \vDash{\nsim}(B_{i}\cdot B_{j})\); and suppose that \(P_{\alpha}[D \pmid C] \lt 1\) for at least one sentence*D*. First notice that we have, for each*i*greater than 1 and less than*n*,so

\[C \vDash{\nsim}(((B_1\vee B_2)\vee \ldots \vee B_i)\cdot B_{i+1}).\]Then, for any finite list of the first

\[ \begin{align} P_{\alpha}&[(((B_1\vee B_2)\vee \ldots \vee B_{n-1})\vee B_n) \pmid C]\\ &\qquad = P_{\alpha}[((B_1\vee B_2)\vee \ldots \vee B_{n-1}) \pmid C] + P_{\alpha}[B_n \pmid C] \\ &\qquad = \ldots \\ &\qquad = \sum^{n}_{i=1} P_{\alpha}[B_i \pmid C].\\ \end{align} \]*n*of the \(B_i\) (for each value of*n*),

3.
This result is not the rule commonly known as *countable
additivity*. Countable additivity requires a language in which
infinite disjunctions are defined.

The present result follows directly from the previous result (without
appealing to *countable additivity*), since, by definition

Given a language in which infinite disjunction is defined, countable additivity would then result from the following rule (or axiom):

\[P_{\alpha}[((B_1\vee B_2)\vee \ldots) \pmid C] = lim_n P_{\alpha}[((B_1\vee B_2)\vee \ldots \vee B_n) \pmid C]. \]
4.
Here are the usual axioms when *unconditional probability* is
taken as basic:

\(P_{\alpha}\) is a function from statements to real numbers between 0 and 1 that satisfies the following rules:

- if \(\vDash\)A (i.e., if
*A*is a logical truth), then \(P_{\alpha}[A] = 1\); - if \(\vDash{\nsim}(A\cdot B)\) (i.e., if
*A*and*B*are logically incompatible), then \(P_{\alpha}[(A\vee B)] = P_{\alpha}[A] + P_{\alpha}[B]\);

Definition: if \(P_{\alpha}[B] \gt 0\), then

\[P_{\alpha}[A \pmid B] = \frac{P_{\alpha}[(A\cdot B)]}{P_{\alpha}[B]}.\]
5.
Bayesians often refer to the probability of an evidence statement on
a hypothesis, \(P[e \pmid h\cdot b\cdot c]\), as the *likelihood of
the hypothesis*. This can be a somewhat confusing convention since
it is clearly the evidence that is made likely to whatever degree by
the hypothesis. So, I will disregard the usual convention here. Also,
presentations of probabilistic inductive logic often suppress *c*
and *b*, and simply write ‘\(P[e \pmid h]\)’. But
*c* and *b* are important parts of the logic of the
likelihoods. So I will continue to make them explicit.

6. These attempts have not been wholly satisfactory thus far, but research continues. For an illuminating discussion of the logic of direct inference and the difficulties involved in providing a formal account, see the series of papers: Levi 1977; Kyburg 1978; Levi 1978. Levi 1980 develops a very sophisticated approach.

Kyburg has developed a logic of statistical inference based solely on logical direct inference probabilities (Kyburg 1974). Kyburg’s logical probabilities do not satisfy the usual axioms of probability theory. The series of papers cited above compares Kyburg’s approach to a kind of Bayesian inductive logic championed by Levi (e.g., in Levi 1967).

7.
This idea should not be confused with *logical positivism* or
*logical empiricism*. A version of *logical positivism*
applied to likelihoods would hold that if two theories assign the same
likelihood values to all possible evidence claims, then they are
essentially the same theory, though they may be couched in different
words. In short: *same likelihoods* implies *same
theory*. The view suggested here, however, is not
*positivism*, but its inverse, which should be much less
controversial: *different likelihoods* implies *different
theories*. That is, given that all of the relevant background and
auxiliaries are made explicit (represented in ‘*b*’),
if two scientists disagree significantly about the likelihoods of
important evidence claims on a given hypothesis, they must understand
the empirical content of that hypothesis quite differently. To that
extent, though they may employ the same syntactic expressions, they
use them to express empirically distinct hypotheses.

8.
Call an object *grue* at a given time *just in case*
either the time is earlier than the first second of the year 2030 and
the object is green or the time is not earlier than the first second
of 2030 and the object is blue. Now the statement ‘All emeralds
are green (at all times)’ has the same syntactic structure as
‘All emeralds are grue (at all times)’. So, if syntactic
structure determines priors, then these two hypotheses should have the
same prior probabilities. Indeed, both should have prior probabilities
approaching 0. For, there are an infinite number of competitors of
these two hypotheses, each sharing the same syntactic structure:
consider the hypotheses ‘All emeralds are grue\(_n\) (at all
times)’, where an object is grue\(_n\) at a given time *just
in case* either the time is earlier than the first second of the
*n*^{th} day after January 1, 2030, and the object is
green *or* the time is not earlier than the first second of the
*n*^{th} day after January 1, 2030, and the object is
blue. A purely syntactic specification of the priors should assign all
of these hypotheses the same prior probability. But these are mutually
exclusive hypotheses; so their prior probabilities must sum to a value
no greater than 1. The only way this can happen is for ‘All
emeralds are green’ and each of its grue\(_n\) competitors to
have prior probability values either equal to 0 or infinitesimally
close to it.

9.
This assumption may be substantially relaxed without affecting the
analysis below; we might instead only suppose that the ratios
\(P_{\alpha}[c^n \pmid h_j\cdot b]/P_{\alpha}[c^n \pmid h_i\cdot b]\)
are bounded so as not to get exceptionally far from 1. If
*that* supposition were to fail, then the mere occurrence of
the experimental conditions would count as very strong evidence for or
against hypotheses—a highly implausible effect. Our analysis
could include such bounded condition-ratios, but this would only add
inessential complexity to our treatment.

10. For example, when a new disease is discovered, a new hypothesis \(h_{u+1}\) about that disease being a possible cause of patients’ symptoms is made explicit. The old catch-all was, “the symptoms are caused by some unknown disease—some disease other than \(h_1 ,\ldots ,h_u\)”. So the new catch-all hypothesis must now state that “the symptoms are caused by one of the remaining unknown diseases—some disease other than \(h_1 ,\ldots ,h_u, h_{u+1}\)”. And, clearly,

\[ \begin{align} P_{\alpha}[h_K \pmid b] & = P_{\alpha}[{\nsim}h_1\cdot \ldots \cdot{\nsim}h_u \pmid b]\\ & = P_{\alpha}[{\nsim}h_1\cdot \ldots \cdot{\nsim}h_u\cdot(h_{u+1}\vee{\nsim}h_{u+1}) \pmid b]\\ & = P_{\alpha}[{\nsim}h_1\cdot \ldots \cdot{\nsim}h_{u}\cdot{\nsim}h_{u+1} \pmid b] + P_{\alpha}[h_{u+1} \pmid b]\\ & = P_{\alpha}[h_{K*} \pmid b] + P_{\alpha}[h_{u+1} \pmid b]. \end{align} \]Thus, the new hypothesis \(h_{u+1}\) is “peeled off” of the old catch-all hypothesis \(h_K\), leaving a new catch-all hypothesis \(h_{K*}\) with a prior probability value equal to that of the old catch-all minus the prior of the new hypothesis.

11. This claim depends, of course, on \(h_i\) being evidentially distinct from each alternative \(h_j\). I.e., there must be conditions c\(_k\) with possible outcomes o\(_{ku}\) on which the likelihoods differ:

\[ P[o_{ku} \pmid h_{i}\cdot b\cdot c_{k}] \ne P[o_{ku} \pmid h_{j}\cdot b\cdot c_{k}]. \]
Otherwise \(h_i\) and \(h_j\) are empirically equivalent, and no
amount of evidence can support one over the other. (Did you think a
confirmation theory could possibly do better?—could somehow
employ evidence to confirm the true hypothesis over *evidentially
equivalent* rivals?) If the true hypothesis has evidentially
equivalent rivals, then the convergence result implies that the odds
against *the disjunction* of the true hypothesis with these
rivals very probably goes to 0, so the posterior probability of this
*disjunction* goes to 1. Among evidentially equivalent
hypotheses the ratio of their posterior probabilities equals the ratio
of their priors:

So the true hypothesis will have a posterior probability near 1 (after
evidence drives the posteriors of evidentially distinguishable rivals
near to 0) *just in case* plausibility arguments and
considerations (expressed in *b*) make each evidentially
indistinguishable rival so much less plausible by comparison that the
sum of each of their comparative plausibilities (as compared to the
true hypothesis) remains very small.

One more comment about this. It is tempting to identify *evidential
distinguishability* (via the *evidential likelihoods*) with
*empirical distinguishability*. But many plausibility arguments
in the sciences, such as *thought experiments*, draw on broadly
empirical considerations, on what we know or strongly suspect about
how the world works based on our experience of the world. Although
this kind of “evidence” may not be representable via
*evidential likelihoods* (because the hypotheses it bears on
don’t deductively or probabilistically imply it), it often plays
an important role in scientific assessments of hypotheses—in
assessments of whether a hypothesis is so extraordinary that only
really extraordinary likelihood evidence could rescue it. It is
(arguably) a distinct virtue of the Bayesian logic of evidential
support that it permits such considerations to be figured into the net
support for hypotheses.

12.
This is a good place to describe one reason for thinking that
*inductive support functions* must be distinct from
subjectivist or personalist *degree-of-belief functions*.
Although likelihoods have a high degree of objectivity in many
scientific contexts, it is difficult for *belief functions* to
properly represent objective likelihoods. This is an aspect of the
*problem of old evidence*.

*Belief functions* are supposed to provide an idealized model
of belief strengths for agents. They extend the notion of ideally
consistent belief to a probabilistic notion of ideally coherent belief
strengths. There is no harm in this kind of idealization. It is
supposed to supply a normative guide for real decision making. An
agent is supposed to make decisions based on her belief-strengths
about the state of the world, her belief strengths about possible
consequences of actions, and her assessment of the desirability (or
*utility*) of these consequences. But the very role that
*belief functions* are supposed to play in decision making
makes them ill-suited to inductive inferences where the
*likelihoods* are often supposed to be objective, or at least
possess inter-subjectively agreed values that represent the empirical
import of hypotheses. For the purposes of decision making,
degree-of-belief functions *should* represent the agent’s
belief strengths *based on everything she presently knows*. So,
degree-of-belief likelihoods must represent how strongly the agent
would believe the evidence if the hypothesis were added to
*everything else she presently knows*. However,
support-function likelihoods are supposed to represent what the
hypothesis (together with explicit background and experimental
conditions) *says* or *implies* about the evidence. As a
result, *degree-of-belief* likelihoods are saddled with a
version of the *problem of old evidence*, a problem not shared
by support function likelihoods. Furthermore, it turns out that the
old evidence problem for likelihoods is much worse than is usually
recognized.

Here is the problem. If the agent is already certain of an evidence
statement *e*, then her *belief-function* likelihoods for
that statement must be 1 on every hypothesis. I.e., if \(Q_{\gamma}\)
is her *belief function* and \(Q_{\gamma}[e] = 1\), then it
follows from the axioms of probability theory that \(Q_{\gamma}[e
\pmid h_i\cdot b\cdot c] = 1\), regardless of what \(h_i\)
says—even if \(h_i\) implies that *e* is quite unlikely
(given \(b\cdot c)\). But the problem goes even deeper. It not only
applies to evidence that the agent *knows with certainty*. It
turns out that almost anything the agent learns that can change how
strongly she believes *e* will also influence the value of her
*belief-function* likelihood for *e*, because
\(Q_{\gamma}[e \pmid h_i\cdot b\cdot c]\) represents the agent’s
belief strength given *everything she knows*.

To see the difficulty with less-than-certain evidence, consider the
following example. Let *e* be any statement that is statistically
implied to degree *r* by a hypothesis *h* together with
experimental conditions *c* (e.g., *e* says “the coin
lands *heads* on the next toss” and \(h\cdot c\) says
“the coin is fair and is tossed in the usual way on the next
toss”). Then the correct objective likelihood value is just
\(P[e \pmid h\cdot c] = r\) (e.g., for \(r = 1/2)\). Let *d* be a
statement that is intuitively not relevant in any way to how likely
*e* should be on \(h\cdot c\) (e.g., let *d* say “Jim
will be really pleased with the outcome of that next toss”).
Suppose some rational agent has a degree-of-belief function *Q*
for which the likelihood for *e* due to \(h\cdot c\) agrees with
the objective value: \(Q[e \pmid h\cdot c] = r\) (e.g., with \(r =
1/2)\).

Our analysis will show that this agent’s belief-strength for
*d* given \({\nsim}e\cdot h\cdot c\) will be a relevant factor;
so suppose that her degree-of-belief in that regard has any value
*s* other than 1: \(Q[d \pmid {\nsim}e\cdot h\cdot c] = s \lt 1\)
(e.g., suppose \(s = 1/2)\). This is a very weak supposition. It only
says that adding \({\nsim}e\cdot h\cdot c\) to everything else the
agent currently knows leaves her less than certain that *d* is
true.

Now, suppose this agent learns the following bit of new information in
a completely convincing way (e.g., I seriously tell her so, and she
believes me completely): \((d\vee e)\) (i.e., Jim will be really
pleased with the outcome of the next toss unless it comes up
*heads*).

Thus, on the usual Bayesian degree-of-belief account the agent is
supposed to update her belief function *Q* to arrive at a new
belief function \(Q_{\textit{new}}\) by the updating rule:

\(Q_{\textit{new}} [S] = Q[S \pmid (d\vee e)]\), for each statementS.

However, this update of the agent’s belief function *has to
screw up* the objectivity of her new belief-function likelihood
for *e* on \(h\cdot c\), because she now should have:

Thus, the updated belief function likelihood must have value

\[Q_{new}[e \pmid h\cdot c] = \frac{1}{1 + s\times \frac{(1- r)}{r}}.\]
This factor can be equal to the correct likelihood value *r* just
in case \(s = 1\). For example, for \(r =
1/2\) and \(s = 1/2\) we get \(Q_{new}[e \pmid h\cdot c] = 2/3\).

The point is that even the most trivial knowledge of disjunctive
claims involving *e* may completely upset the value of the
likelihood for an agent’s belief function. And an agent will
almost always have some such trivial knowledge. Updating on such
conditionals can force the agent’s *belief functions* to
deviate widely from the evidentially relevant objective values of
likelihoods on which scientific hypotheses should be tested.

More generally, it can be shown that the incorporation into a belief
function *Q* of almost any kind of evidence for or against the
truth of a prospective evidence claim *e*—even uncertain
evidence for *e*, as may come through Jeffrey
updating—completely undermines the objective or
inter-subjectively agreed likelihoods that a belief function might
have expressed before updating. This should be no surprise. The
agent’s belief function likelihoods reflect her *total
degree-of-belief* in *e*, based on a hypothesis *h*
together with *everything else she knows* about *e*. So
the agent’s present belief function may capture appropriate
public likelihoods for *e* only if *e* is completely
isolated from the agents other beliefs. And this will rarely be the
case.

One Bayesian subjectivist response to this kind of problem is that the
*belief functions* employed in scientific inductive inferences
should often be “counterfactual” belief functions, which
represent what the agent *would believe* if *e* were
subtracted (in some suitable way) from everything else she knows (see,
e.g., Howson & Urbach 1993). However, our examples show that merely subtracting
*e* won’t do. One must also subtract any disjunctive
statements containing *e*. And it can be shown that one must
subtract any uncertain evidence for or against *e* as well. So
the counterfactual belief function idea needs a lot of working out if
it is to rescue the idea that *subjectivist Bayesian belief
functions* can provide a viable account of the likelihoods
employed by the sciences in inductive inferences.

13. That is, for each inductive support function \(P_{\alpha}\), the posterior \(P_{\alpha}[h_j \pmid b\cdot c^n\cdot e^n]\) must go to 0 as the ratio

\[\frac{P_{\alpha}[h_j \pmid b\cdot c^n\cdot e^n]}{P_{\alpha}[h_i \pmid b\cdot c^n\cdot e^n]}\]goes to 0; and that must occur if the likelihood ratios

\[\frac{P[e^n \pmid h_{j}\cdot b\cdot c^{n}]}{P[e^n \pmid h_{i}\cdot b\cdot c^{n}]}\]
approach 0, provided that and the prior probability \(P_{\alpha}[h_i
\pmid b]\) is greater than 0. The Likelihood Ratio Convergence Theorem
will show that when \(h_i\cdot b\) is true, it is very likely that the
evidence will indeed be such as to drive the likelihood ratios as near
to 0 as you please, for a long enough (or strong enough) evidence
stream. (If the stream is *strong* in that the likelihood
ratios of individual bits of evidence are small, then to bring about a
very small cumulative likelihood ratio, the evidence stream need not
be as long.) As likelihood ratios head towards 0, the only way a
Bayesian agent can avoid having her inductive support function(s)
yield posterior probabilities for \(h_j\) that approach 0 (as *n*
gets large) is to continually change her prior probability
assessments. That means either continually finding and adding new
plausibility arguments (i.e., adding to or modifying *b*) that on
balance favor \(h_j\) over \(h_i\), or continually reassessing the
support strength due to plausibility arguments already available, or
both.

Technically, continual reassessments of support strengths that favor
\(h_j\) over \(h_i\) based on already extant arguments (in *b*)
means switching to new support functions (or new *vagueness
sets* of them) that assign \(h_j\) ever higher prior probabilities
as compared to \(h_i\) based on the same arguments in *b*. In any
case, such revisions of argument strengths may avoid the convergence
towards 0 of the posterior probability of \(h_j\) only if it proceeds
at a rate that keeps ahead of the rate at which the evidence drives
the likelihood ratios towards 0.

For a thorough presentation of the most prominent Bayesian convergence results and a discussion of their weaknesses see Earman 1992: Ch. 6. However, Earman does not discuss the convergence theorems under consideration here (due to the fact that the convergence results discussed here first appeared in Hawthorne 1993, just after Earman’s book came out).

14.
In scientific contexts all of the most important kinds of cases where
large components of the evidence fail to be
*result-independent* of one another are cases where some part
of the total evidence helps to tie down the numerical value of a
parameter that plays an important role in the likelihood values the
hypothesis specifies for other large parts of the total evidence. In
cases where this only happens rather *locally*, where the
evidence for a parameter value influences the likelihoods of only a
very small part of the total evidence that bears on the hypothesis, we
can treat the conjunction of *the evidence for the parameter
value* with *the evidential outcomes whose likelihood the
parameter value influences* as a single *chunk of
evidence*, which is then *result-independent* of the rest
of the evidence (on each alternative hypothesis). This is the sort of
*chuncking of the evidence* into *result-independent
parts* suggested in the main text.

However, in cases where the value of a parameter left unspecified by
the hypothesis has a wide-ranging influence on many of the likelihood
values the hypothesis specifies, another strategy for obtaining
*result-independence* among these components of the evidence
will do the job. A hypothesis that has an unspecified parameter value
is in effect equivalent to a *disjunction of more specific
hypotheses*, where each disjunct consists of a more precise
version of the original hypothesis, a version in which the value for
the parameter has been “filled in”. Relative to each of
these more precise hypotheses, any evidence for or against the
parameter value that hypothesis specifies is evidence for or against
that more precise hypothesis itself. Furthermore, the evidence whose
likelihood values depend on the parameter value (and because of that,
failed to be *result-independent* of the parameter value
evidence relative to the original hypothesis) is
*result-independent* of the parameter value evidence relative
to each of these more precise hypotheses—because each of the
precise hypotheses already identifies precisely what (it claims) the
value of the parameter is. Thus, wherever the workings of the logic of
evidential support is made more perspicuous by treating evidence as
composed of *result-independent chunks*, one may treat
hypotheses whose unspecified parameter values interfere with
*result-independence* as *disjunctively composite
hypotheses*, and apply the evidential logic to these more specific
disjuncts, and thereby regain *result-independence*.

15. Technically, suppose that \(O_k\) can be further “subdivided” into more outcome-descriptions by replacing \(o_{kv}\) with two mutually exclusive parts, \(o_{kv}^*\) and \(o_{kv}^\#\), to produce new outcome space

\[ O_{k}^$ = \{o_{k1},\ldots, o_{kv}^*, o_{kv}^\#,\ldots , o_{kw}\},\]where

\[P[o_{kv}^*\cdot o_{kv}^\# \pmid h_{i}\cdot b\cdot c_{k}] = 0\]and

\[P[o_{kv}^{* } \pmid h_{i}\cdot b\cdot c_{k}] + P[o_{kv}^\# \pmid h_{i}\cdot b\cdot c_{k}] = P[o_{kv} \pmid h_{i}\cdot b\cdot c_{k}];\]
and suppose similar relationships hold for \(h_j\). Then the new EQI*
(based on \(O_{k}^*)\) is greater than or equal to EQI (based on
\(O_k)\); and \(\EQI^* \gt \EQI\) *just in case* at least one
of the new likelihood ratios, e.g.,

differs in value from the “undivided” outcome’s likelihood ratio,

\[\frac{P[o_{kv} \pmid h_{i}\cdot b\cdot c_{k}] }{P[o_{kv} \pmid h_{i}\cdot b\cdot c_{k}]}.\]The supplement on the effect on EQI of partitioning the outcome space proves this claim.

16. The likely rate of convergence will almost always be much faster than the worst case bound provided by Theorem 2. To see the point more clearly, let’s look at a very simple example. Suppose \(h_i\) says that a certain bent coin has a propensity for “heads” of 2/3 and \(h_j\) says the propensity is 1/3. Let the evidence stream consist of outcomes of tosses. In this case the average EQI equals the EQI of each toss, which is 1/3; and the smallest possible likelihood ratio occurs for “heads”, which yields the value \(\gamma = \frac{1}{2}\). So, the value of the lower bound given by Theorem 2 for the likelihood of getting an outcome sequences with a likelihood ratio below \(\varepsilon\) (for \(h_j\) over \(h_i)\) is

\[ 1 -\frac{ (1/n)(\log \tfrac{1}{2})^2}{((1/3) + (\log \varepsilon)/n)^2} = 1 - \frac{9}{(n\times(1 + 3(\log \varepsilon)/n)^2}. \]
Thus, according to the theorem, the likelihood of getting an outcome
sequence with a likelihood ratio less than \(\varepsilon = 1/16\)
(=.06) when \(h_i\) is true and the number of tosses is \(n = 52\) is
*at least* .70; and for \(n = 204\) tosses the likelihood is
*at least* .95.

To see the amount by which the lower bound provided by the theorem is
in fact *overly cautious*, consider what the usual binomial
distribution for the coin tosses in this example implies about the
likely values of the likelihood ratios. The likelihood ratio for
exactly *k* “heads” in *n* tosses is

and we want this likelihood ratio to have a value less than \(\varepsilon\). A bit of algebraic manipulation shows that to get this likelihood ratio value to be below \(\varepsilon\), the percentage of “heads” needs to be \(k/n \gt \frac{1}{2} - \frac{1}{2}(\log \varepsilon)/n\). Using the normal approximation to the binomial distribution (with mean \(= 2/3\) and variance = \((2/3)(1/3)/n)\) the actual likelihood of obtaining an outcome sequence having more than \(\frac{1}{2} - \frac{1}{2}(\log \varepsilon)/n\) “heads” (which we just saw corresponds to getting a likelihood ratio less than \(\varepsilon\), thus disfavoring the 1/3 propensity hypothesis as compared to the 2/3 propensity hypothesis by that much) when the true propensity for “heads” is 2/3 is given by the formula

\[ \begin{align} \Phi\left[\frac{\left(\textrm{mean} - \left(\frac{1}{2} - \frac{1}{2}\frac{(\log \varepsilon)}{n}\right)\right)}{(\textrm{variance})^{\frac{1}{2}}}\right] = \Phi\left[(1/8)^{\frac{1}{2}}n^{\frac{1}{2}}\left(1 + \frac{3(\log \varepsilon)}{n}\right)\right] \end{align} \]
(where \(\Phi[x]\) gives the value of the standard normal distribution
from \(-\infty\) to *x*). Now let \(\varepsilon = 1/16\) (=
.0625), as before. So the actual likelihood of obtaining a stream of
outcomes with likelihood ratio this small when \(h_i\) is true and the
number of tosses is \(n = 52\) is \(\Phi[1.96] \gt .975\), whereas the
lower bound given by
Theorem 2
was .70. And if the number of tosses is increased to \(n = 204\), the
likelihood of obtaining an outcome sequence with a likelihood ratio
this small (i.e., \(\varepsilon = 1/16)\) is \(\Phi[4.75] \gt
.999999\), whereas the lower bound from Theorem 2 for this likelihood
is .95. Indeed, to actually get a likelihood of .95 that the evidence
stream will produce a likelihood ratio less than \(\varepsilon
\gt\).06, the number of tosses needed is only \(n = 43\), rather than
the 204 tosses the bound given by the theorem requires in order to get
up to the value .95. (Note: These examples employ “identically
distributed” trials—repeated tosses of a coin—as an
illustration. But Convergence Theorem 2 applies much more generally.
It applies to any evidence sequence, no matter how diverse the
probability distributions for the various experiments or observations
in the sequence.)

17. It should now be clear why the boundedness of EQI above 0 is important. Convergence Theorem 2 applies only when

\[\bEQI[c^{n} \pmid h_i /h_j \pmid b] \gt \frac{-(\log \varepsilon)}{n}.\]
But this requirement is not a strong assumption. For, the
**Nonnegativity of EQI Theorem** shows that the empirical
distinctness of two hypotheses on a single possible outcome
*suffices* to make the average EQI positive for the whole
sequence of experiments. So, given any small fraction \(\varepsilon
\gt 0\), the value of \(-(\log \varepsilon)/n\) (which is always
greater than 0 when \(\varepsilon \lt 0)\) will eventually become
smaller than \(\bEQI\), provided that the degree to which the
hypotheses are empirical distinct for the various observations \(c_k\)
does not on average degrade too much as the length *n* of the
evidence stream increases.

When the possible outcomes for the sequence of observations are
independent and identically distributed, Theorems 1 and 2 effectively
reduce to L. J. Savage’s Bayesian Convergence Theorem [Savage,
pg. 52–54], although Savage’s theorem doesn’t supply
explicit lower bounds on the probability that the likelihood ratio
will be small. Independent, identically distributed outcomes most
commonly result from the repetition of identical statistical
experiments (e.g., repeated tosses of a coin, or repeated measurements
of quantum systems prepared in identical states). In such experiments
a hypothesis will specify the same likelihoods for the same kinds of
outcomes from one observation to the next. So \(\bEQI\) will remain
constant as the number of experiments, *n*, increases. However,
Theorems 1 and 2 are much more general. They continue to hold when the
sequence of observations encompasses completely unrelated experiments
that have different distributions on outcomes—experiments that
have nothing in common but their connection to the hypotheses they
test.

18.
In many scientific contexts this is the best we can hope for. But it
still provides a very reasonable representation of inductive support.
Consider, for example, the hypothesis that the land masses of Africa
and South America separated and drifted apart over the eons, the
*drift hypothesis*, as opposed to the hypothesis that the
continents have fixed positions acquired when the earth first formed
and cooled and contracted, the *contraction hypothesis*. One
may not be able to determine anything like precise likelihoods, on
each hypothesis, for the evidence that: (1) the shape of the east
coast of South America matches the shape of the west coast of Africa
as closely as it in fact does; (2) the geology of the two coasts match
up so closely when they are “fitted together” in the
obvious way; (3) the plant and animal species on these distant
continents should be as similar as they are, as compared to how
similar species are among other distant continents. Although neither
the *drift hypothesis* nor the *contraction hypothesis*
supplies anything like precise likelihoods for these evidential
claims, experts readily agree that each of these observations is
*much more likely* on the *drift hypothesis* than on the
*contraction hypothesis*. That is, the likelihood ratio for
this evidence on the *contraction hypothesis* as compared to
the *drift hypothesis* is very small. Thus, jointly these
observations constitute very strong evidence for *drift* over
*contraction*.

Historically, the case of continental drift is more complicated.
Geologists tended to largely dismiss this evidence until the 1960s.
This was not because the evidence wasn’t strong in its own
right. Rather, this evidence was found unconvincing because it was not
sufficient to overcome prior plausibility considerations that made the
*drift hypothesis* extremely implausible—much less
plausible than the *contraction hypothesis*. The problem was
that there seemed to be no plausible mechanism by which *drift*
might occur. It was argued, quite plausibly, that no known force could
push or pull the continents apart, and that the less dense continental
material could not push through the denser material that makes up the
ocean floor. These plausibility objections were overcome when a
plausible mechanism was articulated—i.e., the continental crust
floats atop molten material and moves apart as convection currents in
the molten material carry it along. The case was pretty well clinched
when evidence for this mechanism was found in the form of
“spreading zones” containing alternating strips of
magnetized material at regular distances from mid-ocean ridges. The
magnetic alignments of materials in these strips correspond closely to
the magnetic alignments found in magnetic materials in dateable
sedimentary layers at other locations on the earth. These magnetic
alignments indicate time periods when the direction of earth’s
magnetic field has reversed. And this gave geologists a way of
measuring the rate at which the sea floor might spread and the
continents move apart. Although geologists may not be able to
determine anything like precise values for the likelihoods of any of
this evidence on each of the alternative hypotheses, the evidence is
universally agreed to be *much more likely* on the *drift
hypothesis* than on the alternative *contraction
hypothesis*. The *likelihood ratio* for this evidence on
the *contraction hypothesis* as compared to the *drift
hypothesis* is somewhat vague, but extremely small. The vagueness
is only in regard how extremely small the likelihood ratio is.
Furthermore, with the emergence of a plausible mechanism, the
*drift hypothesis* is no longer so overwhelmingly implausible
*(logically) prior* to taking the likelihood evidence into
account. Thus, even when precise values for individual likelihoods are
not available, the value of *a likelihood ratio range* may be
*objective enough* to strongly refute one hypothesis as
compared to another. Indeed, the *drift hypothesis* is itself
strongly supported by the evidence; for, no alternative hypothesis
that has the slightest amount of comparative plausibility can account
for the available evidence nearly so well. That is, no plausible
alternative makes the evidence anywhere near so likely. Given the
currently available evidence, the only issues left open (for now)
involve comparing various alternative versions of the drift hypothesis
(involving differences of detail) against one another.

19. To see the point of the third clause, suppose it were violated. That is, suppose there are possible outcomes for which the likelihood ratio is very near 1 for just one of the two support functions. Then, even a very long sequence of such outcomes might leave the likelihood ratio for one support function almost equal to 1, while the likelihood ratio for the other support function goes to an extreme value. If that can happen for support functions in a class that represent likelihoods for various scientists in the community, then the empirical contents of the hypotheses are either too vague or too much in dispute for meaningful empirical evaluation to occur.

20.
If there are a few directionally controversial likelihood ratios,
where \(P_{\alpha}\) says the ratio is somewhat greater than 1, while
and \(P_{\beta}\) assigns a value somewhat less than 1, these may not
greatly effect the trend of \(P_{\alpha}\) and \(P_{\beta}\) towards
agreement on the refutation and support of hypotheses *provided
that* the controversial ratios are not so extreme as to overwhelm
the stream of other evidence on which the likelihood ratios do
directionally agree. Even so, researches will want to get straight on
what each hypothesis *says* or *implies* about such
cases. While that remains in dispute, the empirical content of each
hypothesis remains unsettlingly vague.

## Notes to Supplement on Enumerative Inductions: Bayesian Estimation and Convergence

21.
What it means for a sample to be *randomly selected* from a
population is philosophically controversial. Various analyses of the
concept have been proposed, and disputed. For our purposes an account
of the following sort will suffice. To say

*S* is a random sample of population *B* with respect to
attribute *A*

means that

the selection set *S* is generated by a process that has an
objective chance (or propensity) *r* of choosing individual
objects that have attribute *A* from among the objects in
population *B*, where on each selection the chance value *r*
agrees with the value *r* of the frequency of *A*s among the
*B*s, \(F[A,B]\).

Defined this way, randomness implies probabilistic independence among
the outcomes of selections with regard to whether they exhibit
attribute *A*, on any given hypothesis about the true value of
the frequency *r* of *A*s among the *B*s.

The tricky part of generating a randomly selected set from the
population is to find a selection process for which the chance of
selecting an *A* each time matches the true frequency without
already knowing what the true frequency value is—i.e., without
already knowing what the value of *r* is. However, there clearly
are ways to do this. Here is one way:

the sample *S* is generated by a process that on each selection
gives each member of *B* an equal chance of being selected into
*S* (like drawing balls from a well-shaken urn).

Here, schematically, is another way:

find a subclass of *B*, call it *C*, from which *S* can
be generated by a process that gives every member of *C* an equal
chance of being selected into *S*, where *C* is
*representative* of *B* with respect to *A* in the
sense that the frequency of *A* in *C* is almost precisely
the same as the frequency of *A* in *B*.

Pollsters use a process of this kind. Ideally a poll of registered
voters, population *B*, should select a sample *S* in a way
that gives every registered voter the same chance of getting selected
into *S*. But that may be impractical. However, it suffices if
the sample is selected from a representative subpopulation *C* of
*B*—e.g., from registered voters who answered the telephone
between the hours of 7 PM and 9 PM in the middle of the week. Of
course, the claim that a given subpopulation *C* is
*representative* is itself a hypothesis that is open to
inductive support by evidence. Professional polling organizations do a
lot of research to calibrate their sampling technique, to find out
what sort of subpopulations *C* they may draw on as highly
representative. For example, one way to see if registered voters who
answer the phone during the evening, mid-week, are likely to
constitute a representative sample is to conduct a large poll of such
voters immediately after an election, when the result is known, to see
how representative of the actual vote count the count from of the
subpopulation turns out to be.

Notice that although the selection set *S* is *selected from
B*, *S* cannot be a subset of *B*, not if *S* can
be generated by *sampling with replacement*. For, a specific
member of *B* may be randomly selected into *S* more than
once. If *S* were a subset of *B*, any specific member of
*B* could only occur once in *S*. That is, consider the case
where *S* consists of *n* selections from *B*, but
where the process happens to select the same member *b* of
*B* twice. Then, were *S* a subset of *B*, although
*b* is selected into *S* twice, *S* can only possess
*b* as a member once, so *S* has at most \(n-1\) members
after all (even fewer if other members of *B* are selected more
than once). So, rather than being members of *B*, the members of
*S* must be *representations of members of B*, like names,
where the same member of *B* may be represented by different
names. However, the representations (or names) in *S* technically
may not be the sorts of things that can possess attribute *A*.
So, technically, on this way of handling the problem, when we say that
a member of *S exhibits A*, this is shorthand for *the
referent of S in B possesses attribute A*.

22.
This is closely analogous to the Stable-Estimation Theorem of
Edwards, Lindman, & Savage (1993). Here is a proof of Case 1,
i.e., where the number of members of the reference class *B* is
finite and where for some integer *u* at least as large as the
size of *B* there is a specific (perhaps very large) integer
*K* such that the prior probability of a hypothesis stating a
frequency outside region *R* is never more than *K* times as
large as a hypothesis stating a frequency within region *R*. (The
proof is Case 2 is almost exactly the same, but draws on integrals
wherever the present proof draws on sums using the
‘\(\sum\)’ expression.)

A few observations before proceeding to the main derivation:

- The hypotheses under consideration consist of all expressions of
form \(F[A,B] = k/u\), where
*u*is as described above and*k*is a non-negative integer between 0 and*u*. *R*is some set of fractions of form \(k/u\) for a contiguous sequence of non-negative integers*k*that includes the sample frequency \(m/n\).- In the following derivation all sums over values
*r*in*R*are abbreviations for sums over integers*k*such that \(k/u\) is in*R*; similarly, all sums over values*s*not in*R*are abbreviations for sums over integers*k*such that \(k/u\) is not in*R*. The sum over \(\{s \pmid s=k/u\}\) represents the sum over all integers*k*from 0 through*u*. -
Define

\[K \ge \frac{P_{\alpha}[F[A,B]=s \pmid b] }{ P_{\alpha}[F[A,B]=r \pmid b]}\]*L*to be the smallest value of a prior probability \(P_{\alpha}[F[A,B]=r \pmid b]\) for*r*a fraction in*R*. Notice that \(L \gt 0\) because, by supposition, finitefor the largest value of \(P_{\alpha}[F[A,B]=s \pmid b]\) for which

*s*is outside of*R*and the smallest value of \(P_{\alpha}[F[A,B]=r \pmid b]\) for which*r*is outside of region*R*. - Thus, from the definition of
*L*and of*K*, it follows that: \(K \ge P_{\alpha}[F[A,B]=s \pmid b] / L\) for each value of \(P_{\alpha}[F[A,B]=s \pmid b]\) for which*s*is outside of*R*; and \(1 \le P_{\alpha}[F[A,B]=r \pmid b] / L\) for each value of \(P_{\alpha}[F[A,B]=r \pmid b]\) for which*r*is inside of*R*. - It follows that:
\[
\begin{multline}
\sum_{s\not\in R} s^m\times(1-s)^{n-m}\times\left(\frac{P_{\alpha}[F[A,B]=s \pmid b]}{L}\right)\\
\le \sum_{s\not\in R} s^m \times(1-s)^{n-m}\times P_{\alpha}[F[A,B]=s \pmid b] \times K
\end{multline}
\]
and

\[ \begin{multline} \sum_{s\not\in R} r^m \times(1-r)^{n-m} \times(P_{\alpha}[F[A,B]=r \pmid b] / L) \\ \ge \sum_{r\in R} r^m \times(1-r)^{n-m} \times P_{\alpha}[F[A,B]=r \pmid b] . \end{multline} \] - For \(\beta[R, m+1, n-m+1]\) defined as
\[
\frac{\int_{R} r^{m} (1-r)^{n-m} dr}{\int_{0}^1 r ^m (1-r)^{n-m} dr},
\]
when
*u*is large, its an established mathematical fact that \[ \frac{\sum_{r\in R} r^m\times(1-r)^{n-m}} {\sum_{s\in \{s \pmid s=k/u\}} s^m\times(1-s)^{n-m} } \]is extremely close to the value of \(\beta[R, m+1, n-m+1]\).

We now proceed to the main part of the derivation.

From the Odds Form of Bayes’ Theorem (Equation 10) we have,

\( \begin{align} &\Omega_{\alpha}\left[\begin{aligned} &F[A,B]\not\in R \\ & \begin{split}{} \pmid F[A,S] =m/n & \cdot \Rnd[S,B,A] \\ & \cdot \Size[S] =n \\ & \cdot b \end{split} \end{aligned}\right] \\[1ex] \end{align}\)

\(\begin{align} & = \frac{\sum_{s\not\in R} P_{\alpha}\left[\begin{aligned} & F[A,B]=s \\ &\begin{split} {} \pmid F[A,S] =m/n & \cdot \Rnd[S,B,A] \\ & \cdot \Size[S] =n \\ & \cdot b \end{split} \end{aligned}\right]} {\sum_{r\in R} P_{\alpha}\left[\begin{aligned} & F[A,B] =r \\ &\begin{split} {} \pmid F[A,S] =m/n & \cdot \Rnd[S,B,A] \\ & \cdot \Size[S]=n \\ &\cdot b \end{split} \end{aligned}\right]}\\[1ex] \end{align}\)

\(\begin{align} & = \frac{\sum_{s\not\in R} P\left[\begin{aligned} & F[A,S]=m/n \\ &\begin{split} {} \pmid F[A,B] =s & \cdot \Rnd[S,B,A] \\ & \cdot \Size[S]=n \\ & \cdot b \end{split} \end{aligned}\right] \times P_{\alpha}[F[A,B] =s \pmid b]} {\sum_{r\in R} P\left[\begin{aligned} & F[A,S] =m/n \\ & \begin{split} {} \pmid F[A,B] =r & \cdot \Rnd[S,B,A] \\ & \cdot \Size[S] =n \\ & \cdot b \end{split} \end{aligned}\right] \times P_{\alpha}[F[A,B] =r \pmid b]}\\[1ex] \end{align}\)

\(\begin{align} & = \frac{\sum_{s\not\in R} s^m \times(1-s)^{n-m} \times P_{\alpha}[F[A,B]=s \pmid b]} {\sum_{r\in R} r^m\times(1-r)^{n-m} \times P_{\alpha}[F[A,B]=r \pmid b]}\\[1ex] \end{align}\)

\(\begin{align} & = \frac{\sum_{s\not\in R} s^m\times(1-s)^{n-m} \times(P_{\alpha}[F[A,B]=s \pmid b] / L)} {\sum_{r\in R} r^m\times(1-r)^{n-m} \times(P_{\alpha}[F[A,B]=r \pmid b] / L)}\\[1ex] \end{align}\)

\(\begin{align} & \le \frac{\sum_{s\not\in R} s^m\times(1-s)^{n-m} \times K} {\sum_{r\in R} r^m\times(1-r)^{n-m}}\\[1ex] \end{align}\)

\(\begin{align} &= K \times \frac{\sum_{s\in \{s \pmid s=k/u\}} s^m\times(1-s)^{n-m} - \sum_{r\in R} r^m\times(1-r)^{n-m}} {\sum_{r\in R} r^m\times(1-r)^{n-m}}\\[1ex] \end{align}\)

\(\begin{align} &= K \times \frac{\sum_{s\in \{s \pmid s=k/u\}} s^m\times(1-s)^{n-m}} {\sum_{r\in R} r^m\times(1-r)^{n-m}} - 1\\[1ex] \end{align}\)

\(\begin{align} &\approx K\times[(1/\beta[R, m+1, n-m+1]) - 1]. \end{align} \)

Thus,

\[ \begin{multline} \Omega_{\alpha}\left[\begin{aligned} & F[A,B]\not\in R \\ &\begin{split} \pmid F[A,S] =m/n & \cdot \Rnd[S,B,A] \\ & \cdot \Size[S]=n \\ & \cdot b \end{split} \end{aligned}\right]\\[1ex] \le K\times\left[\left( \frac{1}{\beta[R, m+1, n-m+1]}\right) - 1\right]. \end{multline} \] Then by equation (11), which expresses
the relationship between *posterior probability* and
*posterior odds against*,

23.
To get a better idea of the import of this theorem, let’s
consider some specific values. First notice that the factor
\(r\times(1-r)\) can never be larger than (1/2)\(\times\)(1/2) \(=
1/4\); and the closer *r* is to 1 or 0, the smaller
\(r\times(1-r)\) becomes. So, whatever the value of *r*, the
factor q/\(((r\times(1-r)/n)^{\frac{1}{2}} \le 2\times\)q\(\times
n^{\frac{1}{2}}\). Thus, for any chosen value of *q*,

For example, if \(q =\) .05 and \(n = 400\), then we have (for any
value of *r*),

For \(n = 900\) (and margin \(q =\) .05) this lower bound raises to .997:

\[ \begin{multline} P\left[\begin{aligned} & r-.05 \lt F[A,S] \lt r+.05 \\ &\begin{split} {} \pmid F[A,B] = r & \cdot\Rnd[S,B,A]\\ & \cdot\Size[S] = 900 \end{split} \end{aligned}\right] \ge .997. \end{multline} \]
If we are interested in a smaller margin of error *q*, we can
keep the same sample size and find the value of the lower bound for
that value of *q*. For example,

By increasing the sample size the bound on the likelihood can be made
as close to 1 as we want, for any margin *q* we choose. For
example:

As the sample size *n* becomes larger, it becomes extremely
likely that the sample frequency will come to within any specified
region close to the true frequency *r*, as close as you wish.

## Notes to Supplement on Likelihood Ratios, Likelihoodism, and the Law of Likelihood

24.
To see the point more clearly, consider an example. To keep things
simple, let’s suppose our background *b* says that the
chances of *heads* for tosses of this coin is some whole
percentage between 0% and 100%. Let *c* say that the coin is
tossed in the usual random way; let *e* say that the coin comes
up heads; and for each *r* that is a whole fraction of 100
between 0 and 1, let \(h_{[r]}\) be the *simple statistical
hypothesis* asserting that the chance of heads on each random toss
of this coin is *r*. Now consider the *composite statistical
hypothesis* \(h_{[\gt .65]}\), which asserts that the chance of
heads on each random (independent) toss is greater than .65. From the
axioms of probability we derive the following relationship:

The issue for the *likelihoodist* is that the values of the
terms of form \(P_{\alpha}[h_{[r]} \pmid h_{[\gt .65]}\cdot c\cdot
b]\) are not objectively specified by the composite hypothesis
\(h_{[\gt .65]}\) (together with \(c\cdot b)\), but the value of the
likelihood \(P_{\alpha}[e \pmid h_{[\gt .65]}\cdot c\cdot b]\) depends
essentially on these non-objective factors. So, likelihoods based on
composite statistical hypotheses fail to possess the kind of
objectivity that *likelihoodists* require.

25.
The **Law of Likelihood** and the **Likelihood
Principle** have been formulated in slightly different ways by
various logicians and statisticians. The **Law of
Likelihood** was first identified by that name in Hacking 1965,
and has been invoked more recently by the *likelihoodist*
statisticians A.F.W. Edwards (1972) and R. Royall (1997). R.A. Fisher
(1922) argued for the **Likelihood Principle** early in
the 20^{th} century, although he didn’t call it that.
One of the first places it is discussed under that name is Savage et
al. 1962.