Notes to Inductive Logic

1. Although enumerative inductive arguments may seem similar to what classical statisticians call estimation, they are not really the same thing. As classical statisticians are quick to point out, estimation does not use the sample to inductively support a conclusion about the whole population. Estimation is not supposed to be a kind of inductive inference. Rather, estimation is a decision strategy. The sample frequency will be within two standard deviations of the population frequency in about 95% of all samples. So, if one adopts the strategy of accepting as true the claim that the population frequency is within two standard deviations of the sample frequency, and if one uses this strategy repeatedly for various samples, one should be right about 95% of the time. I will discuss enumerative induction in much more detail later in the article.

2. Another way of understanding axiom (5) is to view it as a generalization of the deduction theorem and its converse. The deduction theorem and converse says this: C ⊨ (BA) if and only if (C·B) ⊨ A. Given axioms (1-4), axiom (5) is equivalent to the following:

5*.    (1 − Pα[(BA) | C])  =  (1 − Pα[A | (B·C)]) × Pα[B | C].

The conditional probability Pα[A | (B·C)] completely discounts the possibility that B is false, whereas the probability of the conditional Pα[(BA) | C] depends significantly on how probable B is (given C), and must approach 1 if Pα[B | C] is near 0. Rule (5*) captures how this difference between the conditional probability and the probability of a conditional works. It says that the distance below 1 of the support-strength of C for (BA) equals the product of the distance below 1 of the support strength of (B·C) for A and the support strength of C for B. This makes good sense: the support of C for (BA) (i.e., for (~BA)) is closer to 1 than the support of (B·C) for A by the multiplicative factor Pα[B | C], which reflects the degree to which C supports ~B. According to Rule (5*), then, for any fixed value of Pα[A | (B·C)] < 1, as Pα[B | C] approaches 0, Pα[(BA) | C] must approach 1.

3. This is not what is commonly referred to as countable additivity. Countable additivity requires a language in which infinitely long disjunctions are defined. It would then specify that Pα[((B1B2)∨…) | C] = i Pα[Bi | C]. The present result may be derived (without appealing to countable additivity) as follows. For each distinct i and j, let C ⊨ ~(Bi·Bj); and suppose that Pα[D | C] < 1 for at least one sentence D. First notice that we have, for each i greater than 1 and less than n, C ⊨ (~(B1·Bi+1)·…· ~(Bi·Bi+1)); so C ⊨ ~(((B1B2)∨ …∨BiBi+1). Then, for any finite list of the first n of the Bi (for each value of n),

Pα[(((B1B2)∨ …∨Bn−1)∨Bn) | C]
     = Pα[((B1B2)∨… ∨Bn−1) | C] + Pα[BnC]
 =  …
 = n
∑   Pα[BiC].
i=1

By definition,


Pα[BiC]
i=1
= limn n
Pα[BiC].
i=1
So, limn Pα[((B1∨ B2)∨…∨Bn) | C] =
Pα[BiC]
i=1

4. Here are the usual axioms when unconditional probability is taken as basic:

Pα is a function from statements to real numbers between 0 and 1 that satisfies the following rules:

  1. if  ⊨A (i.e. if A is a logical truth), then Pα[A] = 1;
  2. if  ⊨~(A·B) (i.e. if A and B are logically incompatible), then Pα[(AB)] = Pα[A] + Pα[B];

Definition: if Pα[B] > 0, then Pα[A | B] = Pα[(A·B)] / Pα[B].

5. Bayesians often refer to the probability of an evidence statement on a hypothesis, P[e | h·b·c], as the likelihood of the hypothesis. This can be a somewhat confusing convention since it is clearly the evidence that is made likely to whatever degree by the hypothesis. So, I will disregard the usual convention here. Also, presentations of probabilistic inductive logic often suppress c and b, and simply write P[e | h]’. But c and b are important parts of the logic of the likelihoods. So I will continue to make them explicit.

6. These attempts have not been wholly satisfactory thus far, but research continues. For an illuminating discussion of the logic of direct inference and the difficulties involved in providing a formal account, see the series of papers (Levi, 1977), (Kyburg, 1978) and (Levi, 1978). Levi (1980) develops a very sophisticated approach.

Kyburg has developed a logic of statistical inference based solely on logical direct inference probabilities (Kyburg, 1974). Kyburg's logical probabilities do not satisfy the usual axioms of probability theory. The series of papers cited above compares Kyburg's approach to a kind of Bayesian inductive logic championed by Levi (e.g., in Levi, 1967).

7. This idea should not be confused with positivism. A version of positivism applied to likelihoods would hold that if two theories assign the same likelihood values to all possible evidence claims, then they are essentially the same theory, though they may be couched in different words. In short: same likelihoods implies same theory. The view suggested here, however, is not positivism, but its inverse, which should be much less controversial: different likelihoods implies different theories. That is, given that all of the relevant background and auxiliaries are made explicit (represented in ‘b’), if two scientists disagree significantly about the likelihoods of important evidence claims on a given hypothesis, they must understand the empirical content of that hypothesis quite differently. To that extent, though they may employ the same syntactic expressions, they use them to express empirically distinct hypotheses.

8. Call an object grue at a given time just in case either the time is earlier than the first second of the year 2030 and the object is green or the time is not earlier than the first second of 2030 and the object is blue. Now the statement ‘All emeralds are green (at all times)’ has the same syntactic structure as ‘All emeralds are grue (at all times)’. So, if syntactic structure determines priors, then these two hypotheses should have the same prior probabilities. Indeed, both should have prior probabilities approaching 0. For, there are an infinite number of competitors of these two hypotheses, each sharing the same syntactic structure: consider the hypotheses ‘All emeralds are gruen (at all times)’, where an object is gruen at a given time just in case either the time is earlier than the first second of the nth day after January 1, 2030, and the object is green or the time is not earlier than the first second of the nth day after January 1, 2030, and the object is blue. A purely syntactic specification of the priors should assign all of these hypotheses the same prior probability. But these are mutually exclusive hypotheses; so their prior probabilities must sum to a value no greater than 1. The only way this can happen is for ‘All emeralds are green’ and each of its gruen competitors to have prior probability values either equal to 0 or infinitesimally close to it.

9. This assumption may be substantially relaxed without affecting the analysis below; we might instead only suppose that the ratios Pα[cn | hj·b]/Pα[cn | hi·b] are bounded so as not to get exceptionally far from 1. If that supposition were to fail, then the mere occurrence of the experimental conditions would count as very strong evidence for or against hypotheses — a highly implausible effect. Our analysis could include such bounded condition-ratios, but this would only add inessential complexity to our treatment.

10. For example, when a new disease is discovered, a new hypothesis hu+1 about that disease being a possible cause of patients’ symptoms is made explicit. The old catch-all was, “the symptoms are caused by some unknown disease — some disease other than h1,…, hu”. So the new catch-all hypothesis must now state that “the symptoms are caused by one of the remaining unknown diseases — some disease other than h1,…, hu, hu+1”. And, clearly, Pα[hK | b] = Pα[~h1·…·~hub] = Pα[~h1·…·~hu· (hu+1∨~hu+1) | b] = Pα[~h1·…·~hu·~hu+1b] + Pα[hu+1b] = Pα[hK* | b] + Pα[hu+1b]. Thus, the new hypothesis hu+1 is “peeled off” of the old catch-all hypothesis hK, leaving a new catch-all hypothesis hK* with a prior probability value equal to that of the old catch-all minus the prior of the new hypothesis.

11. This claim depends, of course, on hi being evidentially distinct from each alternative hj. I.e., there must be conditions ck with possible outcomes oku on which the likelihoods differ: P[oku | hi·b·ck]  ≠ P[oku | hj·b·ck]. Otherwise hi and hj are empirically equivalent, and no amount of evidence can support one over the other. (Did you think a confirmation theory could possibly do better? — could somehow employ evidence to confirm the true hypothesis over evidentially equivalent rivals?) If the true hypothesis has evidentially equivalent rivals, then convergence result just implies that the odds against the disjunction of the true hypothesis with these rivals very probably goes to 0, so the posterior probability of this disjunction goes to 1. Among evidentially equivalent hypotheses the ratio of their posterior probabilities equals the ratio of their priors: Pα[hj | b·cn·en] / Pα[hi | b·cn·en]  =  Pα[hj | b] / Pα[hi | b]. So the true hypothesis will have a posterior probability near 1 (after evidence drives the posteriors of evidentially distinguishable rivals near to 0) just in case plausibility arguments and considerations (expressed in b) make each evidentially indistinguishible rival so much less plausible by comparison that the sum of each of their comparative plausibilities (as compared to the true hypothesis) remains very small.

One more comment about this. It is tempting to identify evidential distinguishability (via the evidential likelihoods) with empirical distinguishability. But many plausibility arguments in the sciences, such as thought experiments, draw on broadly empirical considerations, on what we know or strongly suspect about how the world works based on our experience of the world. Although this kind of “evidence” may not be representable via evidential likelihoods (because the hypotheses it bears on don't deductively or probabilistically imply it), it often plays an important role in scientific assessments of hypotheses — in assessments of whether a hypothesis is so extraordinary that only really extraordinary likelihood evidence could rescue it. It is (arguably) a distinct virtue of the Bayesian logic of evidential support that it permits such considerations to be figured into the net evaluation of support for hypotheses.

12. This is a good place to describe one reason for thinking that inductive support functions must be distinct from subjectivist or personalist degree-of-belief functions. Although likelihoods have a high degree of objectivity in many scientific contexts, it is difficult for belief functions to properly represent objective likelihoods. This is an aspect of the problem of old evidence.

Belief functions are supposed to provide an idealized model of belief strengths for agents. They extend the notion of ideally consistent belief to a probabilistic notion of ideally coherent belief strengths. There is no harm in this kind of idealization. It is supposed to supply a normative guide for real decision making. An agent is supposed to make decisions based on her belief-strengths about the state of the world, her belief strengths about possible consequences of actions, and her assessment of the desirability (or utility) of these consequences. But the very role that belief functions are supposed to play in decision making makes them ill-suited to inductive inferences where the likelihoods are often supposed to be objective, or at least possess inter-subjectively agreed values that represent the empirical import of hypotheses. For the purposes of decision making, degree-of-belief functions should represent the agent's belief strengths based on everything she presently knows. So, degree-of-belief likelihoods must represent how strongly the agent would believe the evidence if the hypothesis were added to everything else she presently knows. However, support-function likelihoods are supposed to represent what the hypothesis (together with explicit background and experimental conditions) says or implies about the evidence. As a result, degree-of-belief likelihoods are saddled with a version of the problem of old evidence – a problem not shared by support function likelihoods. Furthermore, it turns out that the old evidence problem for likelihoods is much worse than is usually recognized.

Here is the problem. If the agent is already certain of an evidence statement e, then her belief-function likelihoods for that statement must be 1 on every hypothesis. I.e., if Qγ is her belief function and Qγ[e] = 1, then it follows from the axioms of probability theory that Qγ[e | hi·b·c] = 1, regardless of what hi says — even if hi implies that e is quite unlikely (given b·c). But the problem goes even deeper. It not only applies to evidence that the agent knows with certainty. It turns out that almost anything the agent learns that can change how strongly she believes e will also influence the value of her belief-function likelihood for e, because Qγ[e | hi·b·c] represents the agent's belief strength given everything she knows.

To see the difficulty with less-than-certain evidence, consider the following example. Let e be any statement that is statistically implied to degree r by a hypothesis h together with experimental conditions c (e.g. e says “the coin lands heads on the next toss” and h·c says “the coin is fair and is tossed in the usual way on the next toss”). Then the correct objective likelihood value is just P[e | h·c] = r (e.g. for r = 1/2). Let d be a statement that is intuitively not relevant in any way to how likely e should be on h·c (e.g. let d say “Jim will be really pleased with the outcome of that next toss”). Suppose some rational agent has a degree-of-belief function Q for which the likelihood for e due to h·c agrees with the objective value: Q[e | h·c] = r (e.g. with r = 1/2).

Our analysis will show that this agent's belief-strength for d given ~e·h·c will be a relevant factor; so suppose that her degree-of-belief in that regard has any value s other than 1: Q[d | ~e·h·c] = s < 1 (e.g., suppose s = 1/2). This is a very weak supposition. It only says that adding ~e·h·c to everything else the agent currently knows leaves her less than certain that d is true.

Now, suppose this agent learns the following bit of new information in a completely convincing way (e.g. I seriously tell her so, and she believes me completely): (de) (i.e., Jim will be really pleased with the outcome of the next toss unless it comes up heads).

Thus, on the usual Bayesian degree-of-belief account the agent is supposed to update her belief function Q to arrive at a new belief function Qnew by the updating rule:

Qnew[S] = Q[S | (de)], for each statement S.
However, this update of the agent's belief function has to screw up the objectivity of her new belief-function likelihood for e on h·c, because she now should have:
Qnew[e | h·c] = Qnew[e·h·c] / Qnew[h·c] = Q[e·h·c | (de)] / Q[h·c | (de)] = Q[(de)·(e·h·c)] / Q[(de)·(h·c)] = Q[(dee | h·c] / Q[(de) | h·c] = Q[e | h·c] / Q[((d·~e)∨e) | h·c] = Q[e | h·c] / [Q[e | h·c] + Q[d·~e | h·c]] = Q[e | h·c] / [Q[e | h·c] + Q[d | ~e · h·c] × Q[~e | h·c]] = r / [r + s×(1− r)] = 1 / [1 + s×(1− r)/r].

Thus, the updated belief function likelihood must have value Qnew[e | h·c] = 1 / [1 + s×(1− r)/r]. This factor can be equal to the correct likelihood value r just in case s = 1. For example, for r = 1/2 and s = 1/2 we get Qnew[e | (h·c] = 2/3.

The point is that even the most trivial knowledge of disjunctive claims involving e may completely upset the value of the likelihood for an agent's belief function. And an agent will almost always have some such trivial knowledge. Updating on such conditionals can force the agent's belief functions to deviate widely from the evidentially relevant objective values of likelihoods on which scientific hypotheses should be tested.

More generally, it can be shown that the incorporation into a belief function Q of almost any kind of evidence for or against the truth of a prospective evidence claim e — even uncertain evidence for e, as may come through Jeffrey updating — completely undermines the objective or inter-subjectively agreed likelihoods that a belief function might have expressed before updating. This should be no surprise. The agent's belief function likelihoods reflect her total degree-of-belief in e, based on a hypothesis h together with everything else she knows about e. So the agent's present belief function may capture appropriate public likelihoods for e only if e is completely isolated from the agents other beliefs. And this will rarely be the case.

One Bayesian subjectivist response to this kind of problem is that the belief functions employed in scientific inductive inferences should often be “counterfactual” belief functions, which represent what the agent would believe if e were subtracted (in some suitable way) from everything else she knows (see, e.g., Howson & Urbach, 1993). However, our examples show that merely subtracting e won't do. One must also subtract any disjunctive statements containing e. And it can be shown that one must subtract any uncertain evidence for or against e as well. So the counterfactual belief function idea needs a lot of working out if it is to rescue the idea that subjectivist Bayesian belief functions can provide a viable account of the likelihoods employed by the sciences in inductive inferences.

13. To see the point more clearly, consider an example. To keep things simple, let's suppose our background b says that the chances of heads for tosses of this coin is some whole percentage between 0% and 100%. Let c say that the coin is tossed in the usual random way; let e say that the coin comes up heads; and for each r that is a whole fraction of 100 between 0 and 1, let h[r] be the simple statistical hypothesis asserting that the chance of heads on each random toss of this coin is r. Now consider the composite statistical hypothesis h[>.65], which asserts that the chance of heads on each random (independent) toss is greater than .65. From the axioms of probability we derive the following relationship: Pα[e | h[>.65]·c·b]  =   P[e | h[.66]·c·b] × Pα[h[.66] | h[>.65]·c·b]  +  P[e | h[.67]·c·b] × Pα[h[.67] | h[>.65]·c·b] + …+ P[e | h[1]·c·b] × Pα[h[1] | h[>.65]·c·b]. The issue for the likelihoodist is that the values of the terms of form Pα[h[r] | h[>.65]·c·b] are not objectively specified by the composite hypothesis h[>.65] (together with c·b), but the value of the likelihood Pα[e | h[>.65]·c·b] depends essentially on these non-objective factors. So, likelihoods based on composite statistical hypotheses fail to possess the kind of objectivity that likelihoodists require.

14. The Law of Likelihood and the Likelihood Principle have been formulated in slightly different ways by various logicians and statisticians. The Law of Likelihood was first identified by that name in Hacking (1965), and has been invoked more recently by the likelihoodist statisticians A.F.W. Edwards (1972) and R. Royall (1997). R.A. Fisher (1922) argued for the Likelihood Principle early in the 20th century, although he didn't call it that. One of the first places it is discussed under that name is (Savage, et al., 1962).

15. What it means for a sample to be randomly selected from a population is philosophically controversial. Various analyses of the concept have been proposed, and disputed. For our purposes an account of the following sort will suffice. To say

S is a random sample of population B with respect to attribute A
means that
the selection set S is generated by a process that has an objective chance (or propensity) r of choosing individual objects that have attribute A from among the objects in population B, where on each selection the chance value r agrees with the value r of the frequency of As among the Bs, F[A,B].

Defined this way, randomness implies probabilistic independence among the outcomes of selections with regard to whether they exhibit attribute A, on any given hypothesis about the true value of the frequency r of As among the Bs.

The tricky part of generating a randomly selected set from the popualtion is to find a selection process for which the chance of selecting an A each time matches the true frequency without already knowing what the true frequency value is — i.e. without already knowing what the value of r is. However, there clearly are ways to do this. Here is one way:

the sample S is generated by a process that on each selection gives each member of B an equal chance of being selected into S (like drawing balls from a well-shaken urn).

Here, schematically, is another way:

find a subclass of B, call it C, from which S can be generated by a process that gives every member of C an equal chance of being selected into S, where C is representative of B with respect to A in the sense that the frequency of A in C is almost precisely the same as the frequency of A in B.

Polsters use a process of this kind. Ideally a poll of registered voters, population B, should select a sample S in a way that gives every registered voter the same chance of getting selected into S. But that may be impractical. However, it suffices if the sample is selected from a representative subpopulation C of B — e.g., from registered voters who answered the telephone between the hours of 7 PM and 9 PM in the middle of the week. Of course, the claim that a given subpopulation C is representative is itself a hypothesis that is open to inductive support by evidence. Professional polling organizations do a lot of research to calibrate their sampling technique, to find out what sort of subpopulations C they may draw on as highly representative. For example, one way to see if registered voters who answer the phone during the evening, mid-week, are likely to constitute a representative sample is to conduct a large poll of such voters immediately after an election, when the result is known, to see how representative of the actual vote count the count from of the subpopulation turns out to be.

Notice that although the selection set S is selected from B, S cannot be a subset of B, not if S can be generated by sampling with replacement. For, a specific member of B may be randomly selected into S more than once. If S were a subset of B, any specific member of B could only occur once in S. That is, consider the case where S consists of n selections from B, but where the process happens to select the same member b of B twice. Then, were S a subset of B, although b is selected into S twice, S can only possess b as a member once, so S has at most n−1 members after all (even fewer if other members of B are selected more than once). So, rather than being members of B, the members of S must be representations of members of B, like names, where the same member of B may be represented by different names. However, the representations (or names) in S technically may not be the sorts of things that can possess attribute A. So, technically, on this way of handling the problem, when we say that a member of S exhibits A, this is shorthand for the referent of S in B possesses attribute A.

16. This is closely analogous to the Stable-Estimation Theorem of (Edwards, Lindman, Savage, 1993). Here is a proof of Case 1, i.e. where the number of members of the reference class B is finite and where for some integer u at least as large as the size of B there is a specific (perhaps very large) integer K such that the prior probability of a hypothesis stating a frequency outside region R is never more than K times as large as a hypothesis stating a frequency within region R. (The proof is Case 2 is almost exactly the same, but draws on integrals wherever the present proof draws on sums using the ‘∑’ expression.)

A few observations before proceeding to the main derivation:

  1. The hypotheses under consideration consist of all expressions of form F[A,B] = k/u, where u is as described above and k is a non-negative integer between 0 and u.
  2. R is some set of fractions of form k/u for a contiguous sequence of non-negative integers k that includes the sample frequency m/n.
  3. In the following derivation all sums over values r in R are abbreviations for sums over integers k such that k/u is in R; similarly, all sums over values s not in R are abbreviations for sums over integers k such that k/u is not in R. The sum over {s | s=k/u} represents the sum over all integers k from 0 through u.
  4. Define L to be the smallest value of a prior probability Pα[F[A,B]=r | b] for r a fraction in R. Notice that L > 0 because, by supposition, finite KPα[F[A,B]=s | b] / Pα[F[A,B]=r | b] for the largest value of Pα[F[A,B]=s | b] for which s is outside of R and the smallest value of Pα[F[A,B]=r | b] for which r is outside of region R.
  5. Thus, from the definition of L and of K, it follows that: KPα[F[A,B]=s | b] / L for each value of Pα[F[A,B]=s | b] for which s is outside of R; and 1 ≤ Pα[F[A,B]=r | b] / L for each value of Pα[F[A,B]=r | b] for which r is inside of R.
  6. It follows that:
    sm×(1−s)nm×(Pα[F[A,B]=s | b] / L)
    sR
     ≤  ∑ sm×(1−s)nm×Pα[F[A,B]=s | b] × K
        sR

    and

    ∑   rm×(1−r)nm × (Pα[F[A,B]=r | b] / L)
    sR
     ≥  ∑   rm×(1−r)nm × Pα[F[A,B]=r | b].
        rR
  7. For β[R, m+1, nm+1] defined as R rm (1−r)nm dr / ∫01 r m (1−r)nm dr, when u is large, its an established mathematical fact that
    ∑   rm×(1−r)nm
    rR
    /
    ∑   sm×(1−s)nm
    s∈{s | s=k/u}

    is extremely close to the value of β[R, m+1, nm+1].

We now proceed to the main part of the derivation.

From the Odds Form of Bayes' Theorem (Equation 10) we have,

Ωα[F[A,B]∉R | F[A,S]=m/n · Rnd[S,B,A] · Size[S]=n · b]

=
sR
Pα[F[A,B]=s | F[A,S]=m/n · Rnd[S,B,A] · Size[S]=n · b]
rR
Pα[F[A,B]=r | F[A,S]=m/n · Rnd[S,B,A] · Size[S]=n · b]
=
sR
P[F[A,S]=m/n | F[A,B]=s · Rnd[S,B,A] · Size[S]=n · b] × Pα[F[A,B]=s | b]
rR
P[F[A,S]=m/n | F[A,B]=r · Rnd[S,B,A] · Size[S]=n · b] × Pα[F[A,B]=r | b]
=
sR
sm×(1−s)nm × Pα[F[A,B]=s | b]
rR
rm×(1−r)nm × Pα[F[A,B]=r | b]
=
sR
sm×(1−s)nm × (Pα[F[A,B]=s | b] / L)
rR
rm×(1−r)nm × (Pα[F[A,B]=r | b] / L)
sR
sm×(1−s)nm × K
rR
rm×(1−r)nm
 =  K  ×
s∈{s | s=k/u}
sm×(1−s)nm   −
rR
rm×(1−r)nm
rR
rm×(1−r)nm
 =  K  ×  
s∈{s | s=k/u}
sm×(1−s)nm  
rR
rm×(1−r)nm
− 1  

 ≈  K×[(1/β[R, m+1, nm+1]) − 1].

Thus,

Ωα[F[A,B]∉R | F[A,S]=m/n · Rnd[S,B,A] · Size[S]=n · b]
    ≤   K×[(1/β[R, m+1, nm+1]) − 1].

Then by equation (11), which expresses the relationship between posterior probability and posterior odds against,

Pα[F[A,B]∈R | F[A,S]=m/n · Rnd[S,B,A] · Size[S]=n · b]  

= 1 / (1 + Ωα[F[A,B]∉R | F[A,S]=m/n · Rnd[S,B,A] · Size[S]=n · b]
≥   1 / (1 + K×[(1/β[R, m+1, nm+1]) − 1]).

17. To get a better idea of the import of this theorem, let's consider some specific values. First notice that the factor r×(1−r) can never be larger than (1/2)×(1/2) = 1/4; and the closer r is to 1 or 0, the smaller r×(1−r) becomes. So, whatever the value of r, the factor q/((r×(1−r)/n)½2×q×n½. Thus, for any chosen value of q,

P[rq < F[A,S] < r+qF[A,B] = r·Rnd[S,B,A]·Size[S] = n]
  ≥ 1 − 2×Φ[−2×q×n½].

For example, if q = .05 and n = 400, then we have (for any value of r),

P[r−.05 < F[A,S] < r+.05 |  F[A,B] = r·Rnd[S,B,A]·Size[S] = 400]  ≥ .95.

For n = 900 (and margin q = .05) this lower bound raises to .997:

P[r−.05 < F[A,S] < r+.05 | F[A,B] = r·Rnd[S,B,A]·Size[S] = 900]  ≥ .997.

If we are interested in a smaller margin of error q, we can keep the same sample size and find the value of the lower bound for that value of q. For example,

P[r−.03 < F[A,S] < r+.03 | F[A,B] = r·Rnd[S,B,A]·Size[S] = 900]  ≥ .928.

By increasing the sample size the bound on the likelihood can be made as close to 1 as we want, for any margin q we choose. For example:

P[r−.01<F[A,S] <r+.01 | F[A,B] = r·Rnd[S,B,A]·Size[S] = 38000]  ≥ .9999.

As the sample size n becomes larger, it becomes extremely likely that the sample frequency will come to within any specified region close to the true frequency r, as close as you wish.

18. That is, for each inductive support function Pα, the posterior Pα[hj | b·cn·en] must go to 0 as the ratio Pα[hj | b·cn·en] / Pα[hi | b·cn·en] goes to 0; and that must occur if the likelihood ratios P[en | hj·b·cn] / P[en | hi·b·cn] approach 0, provided that and the prior probability Pα[hi | b] is greater than 0. The Likelihood Ratio Convergence Theorem will show that when hi·b is true, it is very likely that the evidence will indeed be such as to drive the likelihood ratios as near to 0 as you please, for a long enough (or strong enough) evidence stream. (If the stream is strong in that the likelihood ratios of individual bits of evidence are small, then to bring about a very small cumulative likelihood ratio, the evidence stream need not be as long.) As likelihood ratios head towards 0, the only way a Bayesian agent can avoid having her inductive support function(s) yield posterior probabilities for hj that approach 0 (as n gets large) is to continually change her prior probability assessments. That means either continually finding and adding new plausibility arguments (i.e. adding to or modifying b) that on ballance favor hj over hi, or continually reassessing the support strength due to plausibility arguments already available, or both.

Technically, continual reassessments of support strengths that favor hj over hi based on already extant arguments (in b) means switching to new support functions (or new vagueness sets of them) that assign hj ever higher prior probabilities as compared to hi based on the same arguments in b. In any case, such revisions of argument strengths may avoid the convergence towards 0 of the posterior probability of hj only if it proceeds at a rate that keeps ahead of the rate at which the evidence drives the likelihood ratios towards 0.

For a thorough presentation of the most prominent Bayesian convergence results and a discussion of their weaknesses see (Earman, 1992, Ch. 6). However, Earman does not discuss the convergence theorems under consideration here (due to the fact that the convergence results discussed here first appeared in (Hawthorne, 1993), just after Earman's book came out).

19. In scientific contexts all of the most important kinds of cases where large components of the evidence fail to be result-independent of one another are cases where some part of the total evidence helps to tie down the numerical value of a parameter that plays an important role in the likelihood values the hypothesis specifies for other large parts of the total evidence. In cases where this only happens rather locally, where the evidence for a parameter value influences the likelihoods of only a very small part of the total evidence that bears on the hypothesis, we can treat the conjunction of the evidence for the parameter value with the evidential outcomes whose likelihood the parameter value influences as a single chunk of evidence, which is then result-independent of the rest of the evidence (on each alternative hypothesis). This is the sort of chuncking of the evidence into result-independent parts suggested in the main text.

However, in cases where the value of a parameter left unspecified by the hypothesis has a wide-ranging influence on many of the likelihood values the hypothesis specifies, another strategy for obtaining result-independence among these components of the evidence will do the job. A hypothesis that has an unspecified parameter value is in effect equivalent to a disjunction of more specific hypotheses, where each disjunct consists of a more precise version of the original hypothesis, a version in which the value for the parameter has been “filled in”. Relative to each of these more precise hypotheses, any evidence for or against the parameter value that hypothesis specifies is evidence for or against that more precise hypothesis itself. Furthermore, the evidence whose likelihood values depend on the parameter value (and because of that, failed to be result-independent of the parameter value evidence relative to the original hypothesis) is result-independent of the parameter value evidence relative to each of these more precise hypotheses — because each of the precise hypotheses already identifies precisely what (it claims) the value of the parameter is. Thus, wherever the workings of the logic of evidential support is made more perspicuous by treating evidence as composed of result-independent chunks, one may treat hypotheses whose unspecified parameter values interfere with result-independence as disjunctively composite hypotheses, and apply the evidential logic to these more specific disjuncts, and thereby regain result-independence.

20. Technically, suppose that Ok can be further “subdivided” into more outcome-descriptions by replacing okv with two “mutually exclusive parts”, okv* and okv#, to produce new outcome space Ok$ = {ok1,…,okv*,okv#,…,okw}, where P[okv*·okv# | hi·b·ck] = 0 and P[okvhi·b·ck] + P[okv# | hi·b·ck] = P[okv | hi·b·ck]; and suppose similar relationships hold for hj. Then the new EQI* (based on Ok*) is greater than or equal to EQI (based on Ok); and EQI* > EQI just in case at least one of the new likelihood ratios, e.g., P[okvhi·b·ck] / P[okv* | hj·b·ck], differs in value from the “undivided” outcome's likelihood ratio, P[okv hi·b·ck] / P[okv hi·b·ck]. A supplement linked to this article proves this claim.

21. The likely rate of convergence will almost always be much faster than the worst case bound provided by Theorem 2. To see the point more clearly, let's look at a very simple example. Suppose hi says that a certain bent coin has a propensity for “heads” of 2/3 and hj says the propensity is 1/3. Let the evidence stream consist of outcomes of tosses. In this case the average EQI equals the EQI of each toss, which is 1/3; and the smallest possible likelihood ratio occurs for “heads”, which yields the value γ = ½. So, the value of the lower bound given by Theorem 2 for the likelihood of getting an outcome sequences with a likelihood ratio below ε (for hj over hi) is

1 − (1/n)(log ½)2/((1/3) + (log ε)/n)2  = 1 − 9/(n×(1 + 3(log ε)/n)2.

Thus, according to the theorem, the likelihood of getting an outcome sequence with a likelihood ratio less than ε = 1/16 (=.06) when hi is true and the number of tosses is n = 52 is at least .70; and for n = 204 tosses the likelihood is at least .95.

To see the amount by which the lower bound provided by the theorem is in fact overly cautious, consider what the usual binomial distribution for the coin tosses in this example implies about the likely values of the likelihood ratios. The likelihood ratio for exactly k “heads” in n tosses is ((1/3)k (2/3)nk) / ((2/3)k (1/3)nk) = 2n−2k; and we want this likelihood ratio to have a value less than ε. A bit of algebraic manipulation shows that to get this likelihood ratio value to be below ε, the percentage of “heads” needs to be k/n > ½ − ½(log ε)/n. Using the normal approximation to the binomial distribution (with mean = 2/3 and variance = (2/3)(1/3)/n) the actual likelihood of obtaining an outcome sequence having more than ½ − ½(log ε)/n “heads” (which we just saw corresponds to getting a likelihood ratio less than ε, thus disfavoring the 1/3 propensity hypothesis as compared to the 2/3 propensity hypothesis by that much) when the true propensity for “heads” is 2/3 is given by the formula

Φ[(mean − (½ − ½(log ε)/n))/(variance)½]  = Φ[(1/8)½n½(1 + 3(log ε)/n)]

(where Φ[x] gives the value of the standard normal distribution from −∞ to x). Now let ε = 1/16 (= .0625), as before. So the actual likelihood of obtaining a stream of outcomes with likelihood ratio this small when hi is true and the number of tosses is n = 52 is Φ[1.96] > .975, whereas the lower bound given by Theorem 2 was .70. And if the number of tosses is increased to n = 204, the likelihood of obtaining an outcome sequence with a likelihood ratio this small (i.e., ε = 1/16) is Φ[4.75] > .999999, whereas the lower bound from Theorem 2 for this likelihood is .95. Indeed, to actually get a likelihood of .95 that the evidence stream will produce a likelihood ratio less than ε >.06, the number of tosses needed is only n = 43, rather than the 204 tosses the bound given by the theorem requires in order to get up to the value .95. (Note: These examples employ “identically distributed” trials — repeated tosses of a coin — as an illustration. But Convergence Theorem 2 applies much more generally. It applies to any evidence sequence, no matter how diverse the probability distributions for the various experiments or observations in the sequence.)

22. It should now be clear why the boundedness of EQI above 0 is important. Convergence Theorem 2 applies only when EQI[cn  | hi/hj | b]  >  −(log ε)/n. But this requirement is not a strong assumption. For, the Nonnegativity of EQI Theorem shows that the empirical distinctness of two hypotheses on a single possible outcome suffices to make the average EQI positive for the whole sequence of experiments. So, given any small fraction ε > 0, the value of −(log ε)/n (which is always greater than 0 when ε < 0) will eventually become smaller than EQI, provided that the degree to which the hypotheses are empirical distinct for the various observations ck does not on average degrade too much as the length n of the evidence stream increases.

When the possible outcomes for the sequence of observations are independent and identically distributed, Theorems 1 and 2 effectively reduce to L. J. Savage's Bayesian Convergence Theorem [Savage, pg. 52-54], although Savage's theorem doesn't supply explicit lower bounds on the probability that the likelihood ratio will be small. Independent, identically distributed outcomes most commonly result from the repetition of identical statistical experiments (e.g., repeated tosses of a coin, or repeated measurements of quantum systems prepared in identical states). In such experiments a hypothesis will specify the same likelihoods for the same kinds of outcomes from one observation to the next. So EQI will remain constant as the number of experiments, n, increases. However, Theorems 1 and 2 are much more general. They continue to hold when the sequence of observations encompasses completely unrelated experiments that have different distributions on outcomes — experiments that have nothing in common but their connection to the hypotheses they test.

23. In many scientific contexts this is the best we can hope for. But it still provides a very reasonable representation of inductive support. Consider, for example, the hypothesis that the land masses of Africa and South America separated and drifted apart over the eons, the drift hypothesis, as opposed to the hypothesis that the continents have fixed positions acquired when the earth first formed and cooled and contracted, the contraction hypothesis. One may not be able to determine anything like precise likelihoods, on each hypothesis, for the evidence that: (1) the shape of the east coast of South America matches the shape of the west coast of Africa as closely as it in fact does; (2) the geology of the two coasts match up so closely when they are “fitted together” in the obvious way; (3) the plant and animal species on these distant continents should be as similar as they are, as compared to how similar species are among other distant continents. Although neither the drift hypothesis nor the contraction hypothesis supplies anything like precise likelihoods for these evidential claims, experts readily agree that each of these observations is much more likely on the drift hypothesis than on the contraction hypothesis. That is, the likelihood ratio for this evidence on the contraction hypothesis as compared to the drift hypothesis is very small. Thus, jointly these observations constitute very strong evidence for drift over contraction.

Historically, the case of continental drift is more complicated. Geologists tended to largely dismiss this evidence until the 1960s. This was not because the evidence wasn't strong in its own right. Rather, this evidence was found unconvincing because it was not sufficient to overcome prior plausibility considerations that made the drift hypothesis extremely implausible — much less plausible than the contraction hypothesis. The problem was that there seemed to be no plausible mechanism by which drift might occur. It was argued, quite plausibly, that no known force could push or pull the continents apart, and that the less dense continental material could not push through the denser material that makes up the ocean floor. These plausibility objections were overcome when a plausible mechanism was articulated — i.e. the continental crust floats atop molten material and moves apart as convection currents in the molten material carry it along. The case was pretty well clinched when evidence for this mechanism was found in the form of “spreading zones” containing alternating strips of magnetized material at regular distances from mid-ocean ridges. The magnetic alignments of materials in these strips corresponds closely to the magnetic alignments found in magnetic materials in dateable sedimentary layers at other locations on the earth. These magnetic alignments indicate time periods when the direction of earth's magnetic field has reversed. And this gave geologists a way of measuring the rate at which the sea floor might spread and the continents move apart. Although geologists may not be able to determine anything like precise values for the likelihoods of any of this evidence on each of the alternative hypotheses, the evidence is universally agreed to be much more likely on the drift hypothesis than on the alternative contraction hypothesis. The likelihood ratio for this evidence on the contraction hypothesis as compared to the drift hypothesis is somewhat vague, but extremely small. The vagueness is only in regard how extremely small the likelihood ratio is. Furthermore, with the emergence of a plausible mechanism, the drift hypothesis hypothesis is no longer so overwhelmingly implausible prior to taking the likelihood evidence into account. Thus, even when precise values for individual likelihoods are not available, the value of a likelihood ratio range may be objective enough to strongly refute one hypothesis as compared to another. Indeed, the drift hypothesis is itself strongly supported by the evidence; for, no alternative hypothesis that has the slightest amount of comparative plausibility can account for the available evidence nearly so well. (That is, no plausible alternative makes the evidence anywhere near so likley.) Given the currently available evidence, the only issues left open (for now) involve comparing various alternative versions of the drift hypothesis (involving differences of detail) against one another.

24. To see the point of the third clause, suppose it were violated. That is, suppose there are possible outcomes for which the likelihood ratio is very near 1 for just one of the two support functions. Then, even a very long sequence of such outcomes might leave the likelihood ratio for one support function almost equal to 1, while the likelihood ratio for the other support function goes to an extreme value. If that can happen for support functions in a class that represent likelihoods for various scientists in the community, then the empirical contents of the hypotheses is either too vague or too much in dispute for meaningful empirical evaluation to occur.

25. If there are a few directionally controversial likelihood ratios, where Pα says the ratio is somewhat greater than 1, while and Pβ assigns a value somewhat less than 1, these may not greatly effect the trend of Pα and Pβ towards agreement on the refutation and support of hypotheses provided that the controversial ratios are not so extreme as to overwhelm the stream of other evidence on which the likelihood ratios do directionally agree. Even so, researches will want to get straight on what the hypothesis says or implies about such cases. While that remains in dispute, the empirical content of the hypothesis remains unsettling vague.

Copyright © 2012 by
James Hawthorne <hawthorne@ou.edu>

Open access to the SEP is made possible by a world-wide funding initiative.
Please Read How You Can Help Keep the Encyclopedia Free