Reading Dr. Carrier’s ”Proving History” A Review From a Bayesian Perspective
Tim Hendrix ^{∗} May 4, 2013
Introduction
Dr. Richard Carrier’s new book, ”Proving History” (Prometheus Books, ISBN 978-1-61614-560-6), is the ﬁrst of two volumes in which Dr. Carrier investigates the question if Jesus existed or not. According to Dr. Carrier, the current state of Jesus studies is one of confusion and failures in which all past attempts to recover the ”true” Jesus has failed. According to Dr. Carrier the main problem is the methods employed: Past studies has focused on developing historical criteria to determine which parts of a text (for instance the Gospel of Luke) can be trusted, but according to Dr. Carrier all criteria and their use is ﬂawed. This has as a result led to many incompatible views of what Jesus said or did, and accordingly the question ”Who was Jesus?” has many incompatible answers: a Cynic sage, a Rabinical Holy Man, a Zealot Activist, an Apolytic prophet and so on. Richard Carrier propose that Bayes theorem (see below) should be employed in all areas of historical study. Speciﬁcally, Dr. Carrier propose that the prob- lems plaguing the methods of criteria can be solved by applying Bayes theorem, and this will ﬁnally allow allow the ﬁeld of Jesus studies to advance. What this progress will be like and speciﬁcally, how the question if Jesus exist should be answered, will be the subject of his second volume. I was interested in Dr. Carriers book, both because I have a hobby interest in Jesus studies and found his other book on early christianity, ”Not the impos- sible faith” very enjoyable and informative, but certainly also because Bayesian methods was the focus area of my PhD and my current research area. My main focus in writing this review will therefore be on the technical content relating to the use of Bayes theorem and its applicability to historical questions as argued in the book. The book is divided into six chapters. Chapter one contain an introduction which argues historical Jesus studies in it’s present form is ripe with problems, chapter two introduce the historical method as a set of 12 axioms and 12 rules,
^{∗} Tim Hendrix is not my real name.
For family reasons I prefer not to have my name
associated with my religious views online. All quotations are from ”Proving History”.
1
chapter three introduce Bayes Theorem, chapter four discuss historical methods and seek to demonstrate with formal logic that all valid historical methods reduce to applications of Bayes theorem, chapter ﬁve goes through historical criteria often used in Jesus studies and conclude each is only valid insofar as it agrees with Bayes theorem. Finally chapter six, titled ”the hard stuﬀ”, discuss
a number of issues that arise in applying Bayes theorem as well as Richard
Carriers proposal for how a frequentist and Bayesian view of probabilities can be uniﬁed. In reviewing this book I wish to focus on what I believe are the books main contributions: The ﬁrst point is that Bayes theorem not only applies to the historical method, but that it can be formally proven all historical methods can
be reduced to applications of Bayes theorem and importantly, thinking in this way will give tangible beneﬁts compared to traditional historical methods. The second point is how Dr. Carrier address several philosophical points that are raised throughout the book, for instance the uniﬁcation of the frequentistic and Bayesian view of probabilities. Since I am not a philosopher I will not be able to say much on the philosophical side, but I do think there are a number
of points which fall squarely within my ﬁeld that should be raised.
However before I proceed I will ﬁrst brieﬂy touch upon the Bayesian view of probabilities and Bayes theorem.
1 Bayes Theorem
I wish to begin with a point that may seem pedantic at ﬁrst, namely why we should think Bayes theorem is true at all. Dr. Carrier introduces Bayes Theorem as follows:
In simple terms, Bayes’s Theorem is a logical formula that deals with cases of empirical ambiguity, calculating how conﬁdent we can be in any particular conclusion, given what we know at the time. The theorem was discovered in the late eighteenth century and has since been formally proved, mathematically and logically, so we now know its conclusions are always necessarily true if its premises are true. (Chapter 3)
Unfortunately there are no references for this section, and so it is not explained what deﬁnitions Bayes theorem make use of, which assumptions Bayes theorem rests upon and how it’s proven. For reasons I will return to later I think this omission is problematic. However, shortly after the above quotation, just before introducing the formula for bayes theorem, we are given a reference:
But if you do want to advance to more technical issues of the application and importance of Bayes’s Theorem, there are several highly commendable texts[9]
Footnote 9 has as it’s ﬁrst entry E.T. Jaynes ”Probability Theory” from 2003. I highly endorse this choice and I think most Bayesian statisticians would agree.
2
E.T. Jaynes was not only an inﬂuential physicist, he was a great communicator and his book is in my preferred reference for students. In his book, Jaynes argues Bayes theorem is an extension of logic, and I will attempt to give the gist of Jaynes treatment of Bayes theorem below. Interested readers can ﬁnd an almost complete draft of Jaynes book freely available online ^{1} :
Suppose you want to program a robot that can reason in sensible manner. You want the robot to be reason quantitatively about true/false statements such as:
A = ”The next ﬂip of the coin will be heads”
B = ”There has been life on mars”
C = ”Jesus existed”,
A basic problem is neither we or the robot have perfect knowledge, and so it
must reason under uncertainty. Accordingly, we want the robot to have a notion
of the ”degree of plausibility” of some statements given other statements that
are accepted as true. The most important choice in the above is I have not deﬁned what the ”degree of plausibility” is. Put in other words, the goal is to analyse what ”the degree of plausibility” of some statement could possibly mean and derive a results. Jaynes treatment in Probability Theory is both throughout and en- tertaining ^{2} , and at the end he arrive at the following 3 desiderata a notion of degree of plausibility must fulﬁll:
• The degree of plausibility must be described by a real number
• It must agree with common sense (logic) in the limit of certainty
• It must be consistent
Consistency implies that if we have two ways to reason about the degree of plausibility of a statement, these two ways must give the same result. After some further analysis he arrive at the result that the degree of plausible of
statements A, B, C,
can be described by a function P , and the function
must behave like ordinary probabilities usually do, hereunder Bayes theorem:
_{P}_{(}_{A}_{|}_{B}_{)} _{=} P(B|A)P(A)
P(B)
Where by the notation P (A|B) mean ”the degree of plausibility of A given B. The key point is Bayes theorem now -if we accept what goes into the derivation- not only applies to ﬂips with coins, but to all assignment of the degrees of plausibility of true/false statements we may consider, and the interpretation that a probability is really a degree of plausibility is then called the Bayesian
^{1} c.f. http://bayes.wustl.edu/etj/prob/book.pdf ^{2} It should be noted the argument is not original to E.T. Jaynes, see R.T Cox’s work from 1946 and 1961 or Jaynes book for a detailed discussion of the history
3
interpretation of probabilities. It is in this sense Jaynes (as well as most other who call themselves Bayesians) consider Bayes theorem an extension of logic. These deﬁnitions may appear somewhat technical and irrelevant at this point, however their importance will hopefully be apparent later. For now let us make a few key observations:
• Bayes theorem do not tell us what any particular probability should be
• Bayes theorem do not tell us how we should deﬁne the statements A, B, C, in a particular situation
What Bayes theorem do provide us is a consistency requirement: If we know the probabilities on the right-hand side of the above equation, then know what the probability on the left-hand side should be.
2 Is all historical reasoning just Bayes theorem?
First and foremost, I think it is entirely uncontroversial to say Bayes theorem has something important to say about reasoning in general and so also historical reasoning. For instance, by going through various toy examples, Bayes theorem provide a powerful tool to weed out biases and logical fallacies we are all prone to make. However I believe Dr. Carrier has a more general connection between BT and the historical method in mind. In chapter 3:
Since BT is formally valid and its premises (the probabilities we enter into it) constitute all that we can relevantly say about the likelihood of any historical claim being true, it should follow that all valid historical reasoning is described by Bayes’s Theorem (whether historians are aware of this or not). That would mean any historical reasoning that cannot be validly described by Bayes’s Theorem is itself invalid (all of which I’ll demonstrate in the next chapter). There is no other theorem that can make this claim. But I shall take up the challenge of proving that in the next chapter.
and later, just before the formal proof:
we could simply conclude here and now that Bayes’s The-
orem models and describes all valid historical methods. No other method is needed, apart from the endless plethora of techniques that will be required to apply BT to speciﬁc cases of which the AFE and ABE represent highly generalized examples, but examples at even lower levels of generalization could be explored as well (such as the methods of textual criticism, demographics, or stylometrics). All become logically valid only insofar as they conform to BT and thus are better informed when carried out with full awareness of their Bayesian underpinning. This should already be suﬃciently clear by
) (
4
now, but there are always naysayers. For them, I shall establish this conclusion by formal logic
The crux of the logical argument seem to be this. Dr. Carrier deﬁne variables C, D and E, of which only D and E will be of interest to us. The relevant part of the argument is as follows:
Formally, if C = ”a valid historical method that contradicts BT,” D = ”a valid historical method fully modeled and described by (and thereby reducible to) BT,” and E = ”a valid historical method that is consistent with but only partly modeled and described by BT,” then:
P8 Either C, D, or E. (proper trichotomy)
···
P10 If P5 and P6, then ∼E.
P11 P5 and P6.
C4 Therefore, ∼E
···
To establish premise P5 and P6, we consider a historical claim h, a piece of evidence e and some background knowledge b. The premises are as follows: ^{3}
P5 Anything that can be said about any historical claim h that makes any valid diﬀerence to the probability that h is true will either (a) make h more or less likely on considerations of back- ground knowledge alone or (b) make the evidence more or less likely on considerations of the deductive predictions of h given that same background knowledge or (c) make the evidence more or less likely on considerations of the deductive predictions of some other claim (a claim which entails h is false) given that same background knowledge.
P6 Making h more or less likely on considerations of background knowledge alone is the premise P (h|b) in BT; making the evi- dence more or less likely on considerations of the deductive pre- dictions of h on that same background knowledge is the premise P (e|h.b) in BT; making the evidence more or less likely on con- siderations of the deductive predictions of some other claim that entails h is false is the premise P (e| ∼h.b) in BT; any value for P (h|b) entails the value for the premise P (∼h|b) in BT; and these exhaust all the premises in BT.
Carriers typesetting and accordingly for propositions such
as A = ”It will rain tomorrow” and B = ”It will be cold tomorrow” the notation ∼A means ”not A” (”it will not rain tomorrow”) and A.B means ”A and B” (It will be rainy and cold tomorrow
^{3} I have chosen to follow Dr.
5
I think we can summarize the argument as follows: Consider a valid historical method. Either the historical method is fully or partly described by Bayes theorem. We can rule out the later possibility, E, for the following reason:
Anything that can be said about the probability a historical claim h is true given some background knowledge b and evidence e, denoted by P (h|e.b), will aﬀect either P (h|b), P (e|h.b), P (e| ∼h, b) or P (h|b). However these values fully determine P (h|e.b) according to Bayes theorem:
_{P} _{(}_{h}_{|}_{e}_{.}_{b}_{)} _{=}
P (e|h.b)p(h|b)
p(e|h.b)p(h|b) + p(e| ∼h.b)p(∼h|b)
and so the method must be fully included in Bayes theorem, proving the original
claim. I see two problems with the argument. The ﬁrst is somewhat technical but need to be raise: Though it is not stated explicitly, Dr. Carrier tacitly assume anything we are interested in about a claim h is the probability h is true. However I see no reason why this should be the case. For instance Dempster- Schafer theory establish the support and plausibility (the later term is used to
a diﬀerent eﬀect than I did in the introduction) of a claim and multi-valued
logics attempts to deﬁne and analyze the graded truth of a proposition; all of these are concepts diﬀerent than the probability. It is not at all apparent why these concepts can be ruled out as being either not useful or reducing to Bayes theorem. For instance, suppose we deﬁne Jesus as a ”highly inspirational prophet”, a great many in my ﬁeld would say the modiﬁer ”highly” is not well analysed in terms of probabilities but requires other tools. More generally, it goes without saying we do not have a general theory for cognition, and I would be very surprised if that theory turned out to reduce to probability theory in the case of history. The second problem is more concrete and relates to the scope of what is being demonstrated: Lets assume we are only interested in the probability of a claim
h being true. As noted in the past section, Bayes theorem is clearly only saying something about how the quantity on the left-hand side of the above equation, P (h|e.b), must be related to those on the right-hand side, and Dr. Carrier
is correct in pointing out any change in P (h|e.b) must (this is pure algebra!)
correspond to a change in at least one term on the right-hand side. The problem
is we do not know what those quantities on the right-hand side are numerically,
and we cannot ﬁgure them out only by applying Bayes theorem more times. For instance, applying Bayes theorem to the term P (e|h.b) will require knowledge of P (h|e.b), exactly the term we set out to determine. This however seems to severely undercut the importance of what is being
demonstrated. Let me illustrate this by an example: Lets say I make a claim such as:
Basic algebra [Bayes’s Theorem] models and describes all valid methods for reasoning about unemployment [historical methods]
My proof goes at follows: Let X be the number of unemployed people, Y is the number of people who are physically unable to work due to some disability and
6
Z is the number of people who can work but have not found a work. Now the algebra:
X = Y + Z
(contrast this equation to Bayes theorem). I can now make an equivalent claim to P5 and P6: All that can be validly said about X must imply a change in either Y or Z, and I can conclude: all that can validly be said about the number of unemployed people must therefore be described by algebra. Clearly in some sense this is true however it misses nearly everything of economical interest such as what actually aﬀects the terms Y and Z and by how much; while it is clear if X change at least one of the terms Y or Z have to change, algebra does not tell us which, just as Bayes theorem does not tell us what the quantities P (e|h.b), P (h|b), · · · actually are, and it does not tell us how the propositions e, h, b should be deﬁned. Suppose we try to rescue the idea of a formal proof by accepting the term ”a valid historical method” simply mean system (or method) of inference which operate on the probability of propositions, without worrying which proposi- tions are relevant (which Bayes theorem does not say) or how to obtain these probabilities (which Bayes theorem does not say either). But if we accept this deﬁnition, I see no reason why we could not simply replace the argument in chapter 4 by the following:
Bayesian inference describes the relationship between probabil- ities of various propositions (c.f. Jaynes, 2003). In particular it applies when the propositions are related to historical events.
This claim would of course be hard to expand into about half a chapter. It is of course true Bayesian methods has found wide applications in almost all sciences, but this has been because Bayesian methods has shown themselves to work. I completely agree with Dr. Carrier that there are reasons to con- sider how Bayesian methods could be applied to history so as to give tangible results, but the main point is this must be settled by giving examples of actual applications that oﬀer tangible beneﬁts, just as it has been the case in all other scientiﬁc disciplines where Bayesian methods are presently applied. This is what I will focus on in the next sections.
Applications of Bayes theorem in ”Proving History”
To my surprise, Proving History contains almost no applications of Bayes the- orem to historical problems. The purpose of most of the applications of Bayes theorem in Proving History is to illustrate aspects of Bayes theorem and show how it agree with our common intuition. Take for instance the ﬁrst example in the book, the analysis of the disappearing sun in chapter in chapter 3, which seem mainly intended to show how diﬀerent degree of evidence aﬀect ones con- clusion in Bayes theorem. The example considers an ahistorical disappearing
3
7
sun in 1989 with overwhelming observational evidence, and the claimed disap- pearing sun in the gospels with very little evidence, and shows that according to Bayes theorem we should be more inclined to believe the disappearance with overwhelming evidence. This is certainly true, however it is not telling us any- thing new. The example which by far receive the most extensive treatment is the criteria of embarrassment, for which the discussion take up about half of chapter ﬁve and end with a computation of probabilities. I will therefore only focus on this example:
3.1 The criteria of embarrassment
The criteria of embarrassment (EC) is as follows:
The EC (or Embarrassment Criterion) is based on the folk be- lief that if an author says something that would embarrass him, it must be true, because he wouldn’t embarrass himself with a lie. An EC argument (or Argument from Embarrassment) is an attempt to apply this principle to derive the conclusion that the embarrassing statement is historically true. For example, the criterion of embar- rassment states that material that would have been embarrassing to early Christians is more likely to be historical since it is unlikely that they would have made up material that would have placed them or Jesus in a bad light, (Chapter 5)
Dr. Carrier then oﬀers an extended discussion of some of the problems with the criteria of embarrassment which I found well written an interesting. The problems raised are: (1) the gospels are themselves very late making it prob- lematic to assume the authors had access to an embarrassing core tradition they felt compelled to write down (2) we do not know what would be embarrassing for the early church and (3) would the gospel authors pen something genuinely embarrassing at all?. There then follows treatments of several ”embarrassing” stories in the gospels where Dr. Carrier argues (convincingly in my opinion) there can be little ground for an application of the EC. We then gets to the application of Bayes theorem:
Thus, for the Gospels, we’re faced with the following logic. If N (T ) = the number of true embarrassing stories there actually were in any friendly source, N (∼ T ) = the number of false embarrass- ing stories that were fabricated by friendly sources, N (T.M ) = the number of true embarrassing stories coinciding with a motive for friendly sources to preserve them that was suﬃcient to cause them to be preserved, N (∼ T.M ) = the number of false embar- rassing stories (fabricated by friendly sources) coinciding with a motive for friendly sources to preserve them that was suﬃcient to cause them to be preserved, and N (P ) = the number of embar- rassing stories that were preserved (both true and fabricated), then
8
N (P ) = N (T.M ) + N (∼T.M ), and P (T |P ), the frequency of true stories among all embarrassing stories preserved, = N (T.M )/N (P ), which entails P (T |P ) = N (T.M )/(N (T.M ) + N (∼T.M )) Since all we have are friendly sources that have no independently conﬁrmed reliability, and no conﬁrmed evidence of there ever being any reliable neutral or hostile sources, it further follows that N (T.M ) = qN (T ), where q 1, and N (∼T.M ) = 1 × N (∼T ): because all false stories created by friendly sources have motives suﬃcient to preserve them (since that same motive is what created them in the ﬁrst place), whereas this is not the case for true stories that are embarrassing, for few such stories so conveniently come with suﬃcient motives to preserve them (as the entire logic of the EC argument requires). So the frequency of the former must be 1, and the frequency of latter (i.e., q) must be 1. Therefore: [Assuming N (T ) = N (∼T ) and with slight changes to the typesetting]
P(T|P) =
^{N} ^{(}^{T}^{.}^{M} ^{)}
N (T.M ) + N (∼T.M ) ^{=}
qN(T )
q × N(T) + 1 × N(∼T) ^{=}
q
q + 1
So this is saying the probability a story will be true given it is embarrassing will always be less than 0.5, so the EC actually works in reverse!
3.2 Reality Check
If you read a memoir and it said (1) the author severely bullied one of his classmates for a year (2) the author once gave a large sum of money to a homeless man, then, all things being equal, which of the two would you be more inclined to believe the author to had made up? If the memoir was a gospel, we should be more inclined to believe the story of the bullying was made up, however this obviously goes against common sense! As Richard Carrier himself points out, sometimes the EC does work, and any computation must at the least be able to replicate this situation.
3.3 What happened
I think the ﬁrst observation is the quoted argument in Proving History do not actually use Bayes theorem (speciﬁcally, it avoids the use of probabilities), but rely on fractions of the size of appropriately deﬁned sets. I can’t tell why this choice is made, but it is a recurring theme throughout the book to argue for the application of Bayes theorem and then carry out at least a part of the argumentation using non-standard arguments. Another thing I found confusing was how the sets are actually deﬁned and why they are chosen the way they are. To ﬁrst translate the criteria into Bayes theorem we need to deﬁne the
9
appropriate variables. As I understand the text they are deﬁned as follows
T, F : The story is true (as opposed to fabricated)
Pres : The story was preserved
Em : The story is embaressing
The discussion carried out in the text now amount to the following assumptions
P (Pres| ∼T, Em) = 1
P (Pres|T, Em) = q < 1
The ﬁrst assumption is saying the only way someone would fabricate a seemingly embarrassing story is if it serves some purpose and so it must be preserved, and the second is saying a true story which seems embarrassing might not serve a speciﬁc purpose and we are not guaranteed it will be preserved. It should be clear now we are really interested in computing P (T |Pres, Em), the probability a story is true given it is preserved and seems embarrassing. Turning the Bayesian crank:
_{P} _{(}_{T}_{|}_{P}_{r}_{e}_{s}_{,} _{E}_{m}_{)} |
_{=} |
= |
P (Pres|T, Em)P (T |Em)
P (Pres|T, Em)P (T |Em) + P (Pres| ∼T, Em)P (∼T |Em)
qP (T |Em)
qP (T |Em) + P (∼T |Em) ^{=}
q
q + 1
from which the result follows. We can try to translate the result into En- glish: Suppose the gospel writers started out with/made up an equal number of true and false stories that seems embarrassing today. However all the seem- ingly embarrassing stories that are false were made up (by the gospel writers or whoever supplied them with their material) because they were signiﬁcant
and were therefore preserved, and the true seemingly embarrassing stories were preserved/writtern down by the gospel writers at a low rate, q, and therefore almost all seemingly embarrassing stories that survive to this date are false.
A reader might notice I have used the phrase ”seemingly embarrassing”, by
which I mean ”seemingly embarrassing to us”. This is evidently required for the
argument to work. Consider for instance the assumption P (Pres| ∼T, Em) = 1. However if Em meant that it was truly embarrassing to the author, this would
mean that false stories made up by friendly sources (are there any?) which were truly embarrassing would always be preserved – a highly dubious assumption and clearly contrary to Dr. Carrier’s argument.
A basic problem in the above line of argument is there is no way to encode
the information that a story was actually embarrassing. We are, eﬀectively, analysing the criteria for embarrassment without having any way to express a story was embarrassing to the author!. Embarrassing therefore become eﬀectively synonymous with ”embarrassing with a deeper literary meaning” (the reader can try the substitution this phrase in the previous sections and notice the argument become more natural), and the
10
analysis boil down to saying stories with a deeper literary meaning (that also happens to look embarrassing today) are for the most part made up, except a few that are true and happens to have a deeper meaning by accident.
3.4 Adding Embarrassment to the Criteria of Embarrass- ment
To call something an analysis of the criteria of embarrassment, we need to include an expressiveness amongst our variables that include the basic intuition behind the criteria. I believe the following is minimal:
T, F : The story is true or fabricated
Pres : The story was preserved
Em : The story is seemingly embaressing (to us)
Tem : The story was truly embaressing to the author
Lp : The story served a litteraty purpose (we assume ∼Tem = Lp)
Notice Tem mean something diﬀerent than Em: Tem mean the story was em-
barrassing to the one doing the preservation, Em means it seem embarrassing
to us 2000 years later.
serve something that was actually embarrassing which he knew was false, or in symbols:
P (Pres| ∼T, Tem) = 0
The following is always true:
P (Pres, T, Tem|Em) = P (Pres|T, Tem, Em)P (T |Tem, Em)P (Tem|Em)
Where I have been really sloppy in the notation and implicitly assume variables such as T and Tem can also take values ∼T and ∼Tem = LP. The next step is to add simplifying assumptions. I am going to assume
P (Pres|T, Tem, Em) = P (Pres|T, Tem)
P (T |Tem, Em) = P (T |Tem)
To put the EC into words: A person would not pre-
The assumptions here is that our (20th century) interpretation of whether a story is embarrassing or not is secondary to if it was truly embarrassing. Next, lets look at the likelihood term. I will assume:
P (Pres|F, Tem) = 0
P (Pres|F, LP) = l
P (Pres|T, Tem) = c
P (Pres|T, LP) = 1
The ﬁrst and last speciﬁcation is saying an author would never record something truly embarrassing he knew was false, and he would always record something he
11
knew was true and served a literary purpose. The second speciﬁcation is saying the author will (with probability l) include stories that are false but nevertheless serve a literary purpose, and the third that he has a certain candor that makes him sometimes (with probability c) include embarrassing stories he know are true. Turning the Bayesian crank now give:
_{P} _{(}_{T} _{|}_{P}_{r}_{e}_{s}_{,} _{E}_{m}_{)} _{=} P (Tem|Em)P (T |Tem)c + P (LP|Em)P (T |LP)
P (Tem|Em)P (T |Tem)c + P (LP|Em)P (T |LP) + P (F |LP)P (LP|Em)l
This is a bit of a mess. Lets begin by assuming we are equally good at determining if a story is truly embarrassing or serves a literary purpose, ie. P (Tem|Em) = P (LP|Em) = 0.5 and we know nothing of the (conditional) probability a story is true/false, eg. P (T |Tem) = P (T |LP) = 0.5. In this case:
^{c} ^{+} ^{1}
P (T |Pres, Em) =
c + 1 + l We can now try to plug in some limits. Assume the gospel authors have perfect candor and will always report true stories (c = 1) we get:
P (T |Pres, Em) =
2
2 + l ^{∈} ^{[}
_{3} 2 ; 1]
so in this case the criteria of embarrassment actually work. Another case might be where the gospel authors have no candor and will always suppress embar- rassment stories, c = 0, and in this case
P (T |Pres, Em) =
1
1 + l ^{∈} ^{[}
1 _{2} ; 1]
so actually the criteria of embarrassment also work in this limit(!). To recover Dr. Carrier’s analysis, we need something more. Inspecting the full expression reveal the easiest thing to assume is something like:
P (T |LP) = q <
1
2
Which is saying stories that serves a literary purpose are likely to be made up. I suppose which value you think q would have depend on how you view Jesus:
Do you expect him to have lived the sort of life where many of the things he did or said would have a deeper literary purpose afterwards? Your religious views may inﬂuence how you judge that question to put it mildly. At any rate, this lead to the new expression:
P (T |Pres, Em) =
^{c} ^{+} ^{q}
c + q + (1 − q)l ^{.}
It is diﬃcult to directly relate this expression to Dr. Carrier’s analysis, however lets assume a story is preserved with probability 1 if it is true and serves a literary purpose (l = 1) and a story which is true but also embarrassing will never be preserved (c = 0). Then we simply obtain
P (T |Pres, Em) = q <
1
2
which is qualitatively consistent with Dr. Carrier’s result.
12
3.4.1 Some thorny issues
Dr. Carrier oﬀered one analysis of the EC which indicate embarrassment lower the probability a story is historical, I included a variable that actually allow for a story to be embarrassing and got the opposite result. My point is not to demonstrate one of us is wrong or right, but motivate some questions I think are problematic in terms of applying Bayes theorem to history:
Do we actually model history: I think both Dr. Carriers and my analysis contained a term like P (Pres|T, x) (x possibly meaning diﬀerent things). The model this presume is something akin to the following: The gospel au- thors are compiling (or preserving) a set of stories with knowledge of their truth-value and –at least in my case– knowledge of their literary purpose and them being embarrassing. However I think it is uncontroversial to say this is a bad model of how the gospel authors worked. For instance the gospel authors also made up a good deal of the gospels, changed stories to ﬁt an agenda and so, and this should also ﬁgure in the analysis.
When true is not true: Continuing the above line of thought: With some probability, which we need to estimate, the gospel authors did not know what was true or false per see because they were writing about events that may have happened 40 years prior. This means that conditioning on a variable T (true) is problematic. True seem to more likely mean (with some probability at least) that the statement was something that were being told by the Christian community and believed to be true. This should be included in the analysis.
Where do the stories come from: Continuing the above line of thought, if the Gospel authors had access to a set of stories about Jesus, we need to ask where they came from. This lead to a secondary application of the criteria of embarrassment, but with the subtle diﬀerence that we know even less about who the original compilers (or tellers) of these stories were, what they would ﬁnd embarrassing, what they actually produced and so on, this should also be included in the analysis.
Variable sprawl: A basic point is this: If we want to determine how well the criteria of embarrassment work in a Bayesian fashion, we need to model the underlying situation with some accuracy. Continuing the above line of thought would properly result in a good 10-20-(100?) variables that mean diﬀerent things and are all relevant to determining if a seemingly embarrassing story is historical or not. Basically, every time one have a noun and a ”might” or ”properly”, there is a new variable for the analysis, and we must include this variables in our analysis. Determining what the variables actually mean, what their probabilities are and how they (numer- ically) aﬀect each other is a truly daunting task that scale exponentially in the number of variables. Is it possible to undertake this project and expect some accuracy at the end?
13
Toy models: An alternative view is to undertake the analysis using naive toy models and arguing why large parts of the problem can either be ignored or approximated by these toy model. This is what both I and Dr. Carrier has done. This is properly a more fruitful way to approach the problem, however since all toy models are going to be wrong (the fact Dr. Carrier and I produced exactly opposite results is evidence of this), this raises some basic questions of how the numerical estimates we get out are connected to the historical truth of any given proposition under consideration.
In statistical modelling, or any other science for that matter, whenever one is postulating a model, no matter how reasonable the assumptions that goes into
it may seem, there must be a step where the result is validated in some way by
predicting a feature of the data which can be checked. I hope the disagreement of Dr. Carrier’s model for the Criteria of Embarrassment and my proposed model will convince the reader such measures are required. How such validations should be carried out is not discussed in proving his- tory, nor does one get the impression there would be much of a need in the ﬁrst place. I will try to illustrate how ”Proving History” treats this issue by two examples. The ﬁrst is from chapter six, on resolving expert disagreement, in which it is discussed at some length is how Bayes theorem can be used to make two parties agree:
The most common disagreements are disagreements as to the contents of b (background knowledge) or its analysis (the derivation of estimated frequencies). Knowledge of the validity and mechanics of Bayes’s Theorem, and of all the relevant evidence and scholarship, must of course be a component of b (hence the need for meeting those conditions before proceeding). This basic process of education can involve making a Bayesian argument, allowing opponents to critique it (by giving reasons for rejecting its conclusion), then resolving that critique, then iterating that process until they have no remaining ob- jections (at which time they will realize and understand the validity and operation of Bayes’s Theorem and the soundness of its applica- tion in the present case). So, too, for any other relevant knowledge although they may also have their own information to impart to you, which might in fact change your estimates and results, but either way disagreements are thereby resolved as both parties become equally informed and negotiate a Bayesian calculation whose premises (the four basic probability numbers) neither can object to, and therefore whose conclusion both must accept
A worrying aspect of the above quote is how Dr. Carrier discuss these problems
as having to do with estimating the ”four basic probability numbers”, by which I assume he really intend to say the three numbers p(e|h, b), p(h|b), p(e| ∼h, b). Just to take my toy example from above, there will very clearly be more than four numbers involved. In fact, the amount of numbers will grow exponentially in the number of diﬀerent binary variable (such as T , Em, Tem, etc. in the
14
above) we attempt to treat in our analysis. I think the pressing issue is not if or if not two perfectly rational scholars should in principle end up agreeing, but how we ourselves would know what we were doing had scientiﬁc value and what two scholars should do in practise. The second suggestion in Proving History is a-fortiori reasoning. This roughly means using the largest/smallest plausible value of probabilities in the analysis to see which kind of results one may obtain. I think there are ample reasons to suspect, based on the past example alone, that one can get divergent results this way. At any rate such over or underestimation would not ﬁx the problem of having the wrong model to begin with, a point the toy example above should be suﬃcient to demonstrate.
4 The re-interpretation of probability theory
In my reading of the book there was a number of times where I had prob- lems following the discussion, for instance when discussion how to obtain prior probabilities from frequencies, or the suggestion of a-fortiori reasoning. I think Chapter six, ”The technical stuﬀ”, explain much of this confusion, namely Dr. Carriers suggestion for how one can combine the Bayesian and the frequentist view on probabilities, which is also a main theoretical contribution of the book. Before I return to some more practical considerations I wish to treat Dr. Car- rier’s suggestion in more details. One of the main purposes of chapter six is to address some philosophical issues of Bayesian theory. Dr. Carrier introduce the chapter with these words:
Six issues will be taken up here: a bit more on how to resolve expert disagreements with BT; an explanation of why BT still works when hypotheses are allowed to make generic rather than exact pre- dictions; the technical question of determining a reference class for assigning prior probabilities in BT; a discussion of the need to at- tenuate probability estimates to the outcome of hypothetical models (or a hypothetically inﬁnite series of runs), rather than deriving esti- mates solely from actual data sets (and how we can do either); and a resolution of the epistemological debate between so-called Bayesians and frequentists, where I’ll show that since all Bayesians are in fact actually frequentists, there is no reason for frequentists not to be Bayesians as well. That last may strike those familiar with that debate as rather cheeky. But I doubt you’ll be so skeptical after hav- ing read what I have to say on the matter. That discussion will end with a resolution of a sixth and ﬁnal issue: a demonstration of the actual relationship between physical and epistemic probabilities, showing how the latter always derive from (and approximate) the former.
Where the emphasis are the claims I will focus on in this review. In reviewing Dr. Carriers suggestion, I will not focus so much on the ”debate” between fre-
15
quentists and Bayesians (in my experience it is a not something one encounters very frequently), but rather on Dr. Carriers proposed interpretation of Bayesian probabilities. I apologize in advance the section will be somewhat technical at places, I have tried to structure it by providing what I consider a ”standard” Bayesian answer (these sections will be marked with an *) to the questions Dr. Carrier attempt to answer, and then discuss Dr. Carriers alternative suggestion. But before I begin I think it is useful to review the standard Bayesian in- terpretation of the two central terms Dr. Carrier seek to investigate, namely probabilities and frequencies. The following continue from the introduction of Bayes theorem I outlined in the ﬁrst section. I will refer readers to E.T. Jayne’s book which discuss these issues with much more clarity.
4.1 Probabilities and frequencies. The mainstream view*
The mainstream Bayesian view on frequencies and probabilities can be summa- rized as follows:
Probabilities represent degrees of plausibility. Probabilities therefore refer
to a state of knowledge of a rational agent and are either assigned based on (for
instance) symmetry considerations (the chance a coin come up heads is 50% because there are two sides) or derived from other probabilities according to the rules of probability theory (hereunder Bayes theorem). Frequencies is a factual property of the real world that we measure or
estimate. For instance, if we count 10 cows on the ﬁeld and notice 3 are red, the frequency of red cows is 3/10 = 0.3. This is not a probability. The two things simply refer to completely diﬀerent things: Probabilities change when our state
of knowledge change, frequencies do not.
With these things in mind lets focus on Dr. Carriers deﬁnition of probabili- ties and frequencies:
4.2 Richard Carriers proposal
A key point I found confusing is what Dr. Carrier actually mean by the word
probability. The word is used from the beginning to the end of the book, however an attempt to clarify it’s meaning is only encountered in Chapter 2, right after stating axiom 4: ”Every claim has a nonzero probability of being true or false (unless it’s being true or false is logically impossible)” ^{4} the following clariﬁcation follows:
probability here I mean epistemic probability, which is the
probability that we are correct when aﬃrming a claim is true. Set- ting aside for now what this means or how they’re related, philoso- phers have recognized two diﬀerent kinds of probabilities: physical and epistemic. A physical probability is the probability that an
by
If the historical method reduces to
application of Bayes theorem, shouldn’t we rather be interested in the assumptions behind Bayes theorem?
^{4} The axioms and rules are themselves somewhat.
16
event x happened. An epistemic probability is the probability that our belief that x happened is true.
Notice that both the deﬁnition of ”probability”, ”epistemic probability” and ”physical probability themselves rely on the word ”probability” which is of course circular. The deﬁnition is revisited in chapter 6 in the section ”The role of hypothetical data in determining probability”. The deﬁnition (it is hard to tell if an actual deﬁnition is oﬀered) introduce the axillary concepts ”logical truths”, ”emperical truths” and ”hypothetical truths”. I will confess I found the chapter very diﬃcult to understand, and I will therefore provide quotations before giving my own impression of the various deﬁnitions and arguments such that the reader can form his own opinion.
What are probabilities really probabilities of? Mathematicians and philosophers have long debated the question. Suppose we have a die with four sides (a tetrahedron), its geometry is perfect, and we toss it in a perfectly randomizing way. From the stated facts we can predict that it has a 1 in 4 chance of coming up a 4 based on the geometry of the die, the laws of physics, and the previously proven randomizing eﬀects of the way it will be tossed (and where). This could even be demonstrated with a deductive syllogism (such that from the stated premises, the conclusion necessarily follows). Yet this is still a physical probability. So in principle we can connect logical truths with empirical truths. The diﬀerence is that empiri- cally we don’t always know what all the premises are, or when or whether they apply (e.g. no die’s geometry is ever perfect; we don’t know if the die-thrower may have arranged a scheme to cheat; and countless other things we might never think of). That’s why we can’t prove facts from the armchair.
From this, it seem the ”logical truth” is the observation a perfectly random throw with a perfect die with four sides will come up 4 exactly 1/4’th of the time. Dr. Carrier note this probability is connected to the ”physical probability”, by which I believe is meant how a concrete die will behave. While it is clearly true the two things must be connected in some way, the entire point must be how the two are connected. In the following section Dr. Carrier (correctly) identify this connection as having to do with our lack of knowledge. The text then continue:
Thus we go from logical truths to empirical truths. But we have to go even further, from empirical truths to hypothetical truths. The frequency with which that four-sided die turns up a 4 can be deduced
logically when the premises can all be ascertained to be true, or near
enough that the deviations don’t matter (
means empirically, which means adducing a hypothesis and testing it against the evidence, admitting all the while that no test can leave us absolutely certain. And when these premises can’t be thus ascertained, all we have left is to just empirically test the die: roll
yet ascertained still
),
17
it a bunch of times and see what the frequency of rolling 4 is. Yet that method is actually less accurate. We can prove mathematically that because of random ﬂuctuations the observed frequency usually won’t reﬂect the actual probability. For example, if we roll the die four times and it comes up 4 every time, we cannot conclude the probability that this die will roll a 4 on the next toss is 100% (or even 71%, which is roughly the probability that can be deduced if we don’t assume the other facts in evidence). That’s because if the probability really is 1 in 4, then there is roughly a 4% chance you’ll see a straight run of four 4’s (mathematically: 0.25 ^{4} = 0.00390625)
I believe the above discussion can be summarized as follows: Suppose we have an idealized die with four sides we roll in an idealized way. The chance it will come up 4 is (exactly) 0.25. This is what Dr. Carrier call a hypothetical truth. However, since the die has minute random imperfections, the real chance it will come up 4 is slightly diﬀerent, perhaps 0.256. This is the physical probability. The reason why these two numbers are diﬀerent is because we are unaware of the small imperfections in the die. Now, if we roll an actual die a number of times, say 4, and compute the frequency of times the die will come up 4 to the total number of rolls, we will get a third number which properly will not be any of the above. In fact, the ﬂuctuations that are being discussed are exactly distributed according to the previously introduced expression, viz.:
P (n rolls|N rolls) = ^{N} p ^{n} (1 − p) ^{N}^{−}^{n} , and p = 0.25.
n
While there are a few minor points about the way the problem is laid out (for instance the use of the word ”chance” is problematic; how is that deﬁned without reference to probabilities?) and the terminology, the problems raised above -namely how these three numbers are related- is the central one. We will now turn to Dr. Carrier’s proposal, the discussion continue as follows:
Even a thousand tosses of an absolutely perfect four-sided die will not generate a perfect count of 250 4’s (except but rarely). The equivalent of absolutely perfect randomizer do exist in quantum me- chanics. An experiment involving an electron apparatus could be constructed by a competent physicist that gave a perfect 1 in 4 de- cision every time. Yet even that would not always generate 250 hits every 1,000 runs. Random variation will frequently tilt the results slightly one way or another. Thus, you cannot derive the actual frequency from the data alone. For example, using the hypothetical electron experiment, we might get 256 hits after 1,000 runs. Yet we would be wrong if we concluded the probability of getting a hit the next time around was 0.256. That probability would still be 0.250. We could show this by running the experiment several times again. Not only would we get a diﬀerent result on some of those new runs (thus proving the ﬁrst result should not have been so con- cretely trusted), but when we combined all these data sets, odds are
18
the result would converge even more closely on 0.250. In fact you can graph this like an approach vector over many experiments and see an inevitable curve, whose shape can be quantiﬁed by mathemat- ical calculus, which deductively entails that that curve ends (when extended out to inﬁnity) right at 0.250. Calculus was invented for exactly those kinds of tasks, summing up an inﬁnite number of cases, and deﬁning a curve that can be iterated indeﬁnitely, so we can see where it goes without actually having to draw it (and thus we can count up inﬁnite sums in ﬁnite time).
The last paragraph verge on gobblygog in using technical words in a manner that is both unclear and very hard to recognize. The proposal seem to be that if we carry out the idealized experiment out for suﬃciently long time, the observed frequency will converge towards 0.25. A reader who is unfamiliar with this
result should keep in mind that a formal statement of the result (from the setup I assume it is the weak law of large numbers Dr. Carrier has in mind) contain
the somewhat technical statement: ”
if one is using such an argument to later deﬁne probability there is again an issue of circularity. Directly following the above paragraph is this:
Clearly, from established theory, when working with the imag- ined quantum tabletop experiment we should conclude the frequency of hits is 0.25, even though we will almost never have an actual data set that exhibits exactly that frequency. Hence we must conclude that that hypothetical frequency is more accurate than any actual frequency will be. After all, either the true frequency is the observed frequency or the hypothesized frequency; due to the deductive logic of random variation you know the observed frequency is almost never exactly the true frequency (the probability that it is is always ≤ 0.5, and in fact approaches 0 as the odds deviate from even and the number of runs increases); given any well-founded hypothesis you will know the probability that the hypothesized frequency is the true frequency is > 0.5 (and often 0.5), and certainly not → 0); there- fore P (THE HYPOTHESIZED FREQUENCY IS THE TRUE FREQUENCY) > P (THE OBSERVED FREQUENCY IS THE TRUE FREQUENCY); in fact, quite often P (HYPOTHESIZED) P (OBSERVED). So the same is true in every case, including the four-sided die, and anything else we are measuring the frequency of. Deductive argument from empirically established premises thus produces more accurate estimates of prob- ability.
so
will
converge with probability one
”,
The main philosophical ”charge” (if you will) leveled by Bayesian statesticians against frequentists is a frequentist view tend to require thought-experiments in idealized situations that are run to inﬁnite, and I will just notice we are now having a imagined quantum tabletop experiment where we assume we know the limit frequency is 0.25 (no concrete experiment I can think of would behave like that, and no experiment can be run to the limit of inﬁnite). The typical
19
Bayesian objection is that while we are free to think of this idealized situation as a thought-experiment, it is quite diﬀerent to eg. the situation where we consider the probability a corpse is stolen from a grave. Again I will refer to Jaynes book for a deeper treatment of the problems that arise and again only notice Carrier does not discuss them at all. However Dr. Carrier also introduce some novel problems in his discussion. Consider the statement: ”After all, either the true frequency is the observed frequency or the hypothesized frequency”. But clearly this is false. Suppose i hypothesize that the so-called hypothesized frequency of the die coming up 4 is 0.25. I then roll the die 10 times and get a observed frequency (in Bayesian terms, the frequency) of 3/10. However both of these values are going to be wrong, because clearly the microscopic imperfections in the die is going to mean it will have a diﬀerent ”true frequency” (in Dr. Carriers language) than either 0.25 or 0.3, simply due to the fact there are an inﬁnite number of other candidate true frequencies. The statement is therefore in any practical situation a false dilemma; regarding the inequalities what would happen would be that both sides would tend towards zero, in direct contradiction to what Dr. Carrier write (because, again staying in the frequentist language, the true frequency is with probability 1 something else) and depending on the situation the inequality could go either way. The argument is simply false. Finally, and this is a recurrent theme, it is very hard to tell what has actually been deﬁned. I have carefully gone through the chapter, and the above quotation is the ﬁrst time the word ”hypothetical frequency” is used. But what exactly does it mean? The closest to a deﬁnition is shortly later in chapter six: ”Thus we must instead rely on hypothetical frequencies, that is, frequencies that are generated by hypothesis using the data available which data includes not just the frequency data (from which we can project an observed trend to a limit of inﬁnite runs), but also the physical data regarding the system that’s generating that frequency (like the shape and weight distribution of a die).”. What I think is intended here is to say the ”hypothetical frequency” represent our best guess at what will happen with the die (or quantum tabletop experiment) if we roll it in the future, given given our knowledge of the geometry of the die and past rolls. In Bayesian terms, we would call this the probability. Having introduced observed and hypothetical frequencies, we can now begin to make headway towards deﬁning probabilities, unfortunately it is done in a very indirect manner:
hypothetical frequencies are more accurate than observed
frequencies, should not surprise
facture a very good four-sided die and take pains to use methods of
tossing it that have been proven to randomize well, we don’t need
to roll it even once to know that the hypothetical frequency of this
die rolling 4’s is as near to 0.25 as we need it to be. (
it’s not valid to argue that because hypothetical frequencies are not actual data, and since all we have are actual data, we should only derive our frequencies from the latter. All probability estimates (even
Thus
that
if we take care to manu-
)
20
of the very fuzzy kind historians must make, such as occasioned in chapters 3 through 5) are attempted approximations of the true fre- quencies (as I’ll further explain in the next and last section of this chapter, starting on page 265). So that’s what we’re doing when we subjectively assign probabilities, attempting to predict and thus ap- proximate the true frequencies, which we can only approximate from the ﬁnite data available because those data do not reﬂect the true
Thus we must instead rely on hypotheti-
frequency of anything (
).
cal frequencies, that is, frequencies that are generated by hypothesis using the data available which data includes not just the frequency data (from which we can project an observed trend to a limit of
inﬁnite runs), but also the physical data regarding the system that’s generating that frequency (like the shape and weight distribution of a die). Of course, when we have a lot of good data, the observed and hypothetical frequencies will usually be close enough as to make no diﬀerence. [my italic]
The question I started out with was this: What is a probability in Proving History? To the best of my knowledge, probability is being equated with hy- pothetical frequencies, however this suggestion is deﬁnately non-Bayesian and is plagued by all the problems Bayesian has been raising for nearly a century, starting with Dr. Carriers main technical reference for Bayes theorem, namely Jaynes book. The ﬁrst thing to notice is the discussion above is entirely focused on dies and quantum tabletop computers, that is, experiments which we can easily imagine be carried out over and over again. However these setups are very diﬀerent from the ones we are really interested in, namely probabilities of historical events that perhaps only happened once. To give a concrete example of this diﬃculty, consider the following propositions
A : ”I believe with probability 0.8 that the 8th digit of π is a nine”
In a Bayesian view, the term ”with probability 0.8” refer to a state of knowledge of π, and thus require no axillary considerations; it simply reﬂect me thinking the 8th digit is properly a nine while not being certain. However, in the interpretation above, when we assign a probability of 0.8 to the statement then (to quote): ”what we’re doing when we subjectively assign probabilities,[is] attempting to predict and thus approximate the true frequencies, which we can only approximate from the ﬁnite data available”. But what is the true frequency of the 8th digit in π being a 9? Why should we think there is such a thing? How would we set out to prove it exists? What is the true value of the true frequency? The basic reason why these questions are hard to answer is this: either it is or it is not a nine, and the reason I am uncertain reﬂect only my lack of knowledge. A Bayesian treatment give a direct analysis of this situation, an attempt to connect it to a quantum tabletop experiment does not. The situation is analogous for history. Consider for instance the probability Caesar crossed the Rubicon, or a miracle was made up and attributed to a
21
ﬁrst-century miracle worker. The notion of ”true frequency” in these situations become very hard to deﬁne, however if we accept probability simply refer to our degree of belief there is no need for such thought experiments.
5 The connection between frequencies and prob- abilities
The last section of Chapter six oﬀers a main philosophical point of the book, namely a combination of frequentistic and Bayesian view of probabilities. This is done by re-interpretating what is meant by Bayesian probabilities. The chapter open thus:
Probability is obviously a measure of frequency. If we say 20% of Americans smoke, we mean 1 in 5 Americans smoke, or in other words, if there are 300 million Americans, 60 million Americans smoke. When weathermen tell us there is a 20% chance of rain during the coming daylight hours, they mean either that it will rain over one-ﬁfth of the region for which the prediction was made (i.e., if that region contains a thousand acres, rain will fall on a total of two hundred of those acres before nightfall) or that when comparing all past days for which the same meteorological indicators were present as are present for this current day we would ﬁnd that rain occurred on one out of ﬁve of those days (i.e., if we ﬁnd one hundred such days in the record books, twenty of them were days on which it rained).
Speaking of bold assertions, consider the ﬁrst line: ”Probability is obviously a measure of frequency”. The basic problem is this: If this is obvious, how come Bayesians has failed to see the obvious for 50 years and insisted on probability as being rational degrees of belief, ie. a state of knowledge? if it is obvious how come the main technical reference, Jaynes book, dedicate entire chapters to argue against this misconception? What is of course obvious is one can go from probabilities to frequencies -as I have already illustrated with the example of the coin-, but in that case the implication goes the other way: If the probability is deﬁned in a situation where there is a well-deﬁned experiment, such as with a coin, one can make probabilistic predictions about it’s frequency using Bayesian methods. What is frustrating is Dr. Carriers examples illustrate this well. For instance, if I am the weatherman, if I say i believe it will rain tomorrow with probability 0.2, what I mean is most deﬁnitely not what Dr. Carrier says, ”it will rain over one-ﬁfth of the region”. Think of how variable the weather is and how nonsensical that statement is if you take it at face value! In fact, I would be be almost certain that it might rain over either 1/10 or 1/2 or 1/3 or some other fraction of the region. What I am trying to convey is a have a lack of knowledge whether or not it will rain tomorrow, and my models and data (and possible Bayes theorem) allow me to quantify this as being 0.2, full stop, no further thought-experiments required!.
22
The section continues directly:
Those are all physical probabilities. But what about epistemic probabilities? As it happens, those are physical probabilities, too. They just measure something else: the frequency with which beliefs are true. Hence all Bayesians are in fact frequentists (and as this book has suggested, all frequentists should be Bayesians). When Bayesians talk about probability as a degree of certainty that h is true, they are just talking about the frequency of a diﬀerent thing than days of rain or number of smokers. They are talking about the frequency with which beliefs of a given type are true, where of a given type means backed by the kind of evidence and data that produces those kinds of prior and consequent probabilities. For example, if I say I am 95% certain h is true, I am saying that of all the things I believe that I believe on the same strength and type of
evidence as I have for h, 1 in 20 of those beliefs will nevertheless still
Probability can be expressed in fractions or percentile
be false (
notation, but either is still a ratio, and all ratios by deﬁnition entail a relation between two values, and those values must be meaningful for a probability to be meaningful. For Bayesians, those two values are beliefs that are true and all beliefs backed by a certain comparable quantity and quality of evidence, which values I’ll call T and Q. T is always a subset of Q, and Bayesians are always in eﬀect saying that when we gather together and examine every belief in Q, we’ll ﬁnd that n number of them are T , giving us a ratio, n _{t} /n _{q} , which is the epistemic probability that any belief selected randomly from Q will be true
The good news about the proposal is that it is relatively clearly stated, the bad news is it is both unnecessary and defective. That the deﬁnition is defective is properly best illustrated with a small puzzle: Suppose I have a coin of which I know if I ﬂip it two times (independently), the chance it will come up heads both times is 1/2. What is the probability it will come up heads if I ﬂip it once? The problem is easy to solve: P (HH) = P (H)P (H) = ^{1} _{2} and so P (H) =
).
1/ ^{√} 2. Now, the problem is 1/ ^{√} 2 cannot be represented as a fraction of two integers, so when Dr. Carrier writes: Probability can be expressed in fractions or percentile notation, but either is still a ratio, and all ratios by deﬁnition entail a relation between two values, and those values must be meaningful for a probability to be meaningful., and then go on to deﬁne the probability in terms of fractions of integers (see the quotation above), he is exactly excluding the above case. It goes without saying the coin should not and do not pose a problem from a Bayesian or frequentist perspective. There are two ways to avoid the problem: One is to say we simply don’t care about the coin because it’s a stupid example. In my opinion thats just admitting the proposed deﬁnition do not work. The other is to say the above discussion only applies to epistemic probabilities and the coins probability is
23
something else which we have not deﬁned. The problem is this would create absurdities, because I could then change the type of probability from epistemic to ”that something else” by considering a new system that involve the coin at some point. I think this basic example is fatal in terms of obtaining a general and con-
sistent theory out of Dr. Carriers proposal, but to avoid charges of rejecting a good idea because of some mathematical trickery which can perhaps be ﬁxed,
I want to point out some other more serious ailments of the proposal of which
the coin-example is only a symptom:
Lets simply try to imagine how the proposal can be implemented. Suppose
I consider the statement: ”I will get an email between 16.00-17.00 today”. Lets
say that after thinking about this as carefully I can, possibly using Bayes the- orem, I arrive at a probability of 0.843 of that statement being true. Now, to implement the above deﬁnition, I think very carefully about all I know and, though I cannot at the moment tell how I would arrive at this conclusion, I re- alize I know exactly 3 other things on ”the same type and strength of evidence” as was the case of the email, giving n _{q} = 4. I now need to compute n _{t} , namely:
beliefs that are true. A basic problem is that I wouldn’t know how to do this because I do not know which of these are true or not, so I suppose I should imagine I have access to an oracle that knows the real truth. At any rate, even without the oracle, n _{t} can take the values: 0, 1, 2, 3 and 4. This give 5 diﬀerent possible epistemic probabilities, n _{t} /n _{q} = 0, 1/4, 1/2, 3/4, 1, none of which is 0.843. So does this mean I didn’t really believe the statement at probability 0.843? In which case, with what probability do I believe the statement with, then? Does it mean the probabilities we have available is limited by how many things we know? If taken at face value, the proposal seems entirely ﬂawed. To counter any claim I am quoting Dr. Carrier out of context the proposal is summarized later in the section as follows:
So when you say you are only about 75% sure you’ll win a par- ticular hand of poker, you are saying that of all the beliefs you have that are based on the same physical probabilities available to you in this case, 1 in 4 of them will be false without your knowing it, and since this particular belief could be one of those four, you will act accordingly. So when Bayesians argue that probabilities in BT rep- resent estimates of personal conﬁdence and not actual frequencies, they are simply wrong. Because an estimate of personal conﬁdence is still a frequency: the frequency with which beliefs based on that kind of evidence turn out to be true (or false). As Faris says of Jaynes (who in life was a prominent Bayesian), Jaynes considers the frequency interpretation of probability as far too limiting. In- stead, probability should be interpreted as an indication of a state of knowledge or strength of evidence or amount of information within the context of inductive reasoning. But an indication of a state of knowledge is a frequency: the frequency with which beliefs in that
24
state will actually be true, such that a 0.9 means 1 out of every 10 beliefs achieving that state of knowledge will actually be false (so of all the beliefs you have that are in that state, 1 in 10 are false, you just won’t know which ones). This is true all the way down the line.
If anything I think this write up is even more muddled. To take the poker example, I don’t know 1 in 4 things I know with probability 0.75 will be false. Why should they? It might turn out everything I know with probability 0.75 will be true. Asides suﬀering from the above ﬂaws, it suﬀer from all the other ﬂaws I previously discussed: Suppose I know exactly 4 things with probability 0.75:
The 6th digit of π is 3, that the Brazillians speak Brazillian, that there are 52 states in USA and that Adam and Eve really lived; however these things will all be false! For that reasoner, the frequency of which beliefs based on that type of evidence turn out to be false is 1. This is no problem if we use probability to refer to a state of knowledge, as Jaynes do, but it is a problem if we want to root it in what is actually the case, as Dr. Carrier suggests. Again there is absolutely nothing novel about the points raised here they can all be found in Jaynes book. One might attempt to rescue the proposal as follows. Suppose one say:
”I did not intend to say, ’a [probability of ] 0.9 means 1 out of every 10 beliefs achieving that state of knowledge will actually be false’. I merely meant: The average (or expected value) of n _{t} /n _{q} is 0.9”. The problem of
such a deﬁnition is that it will almost inherently be circular, since the average is computed using the probability, and so cannot be used to deﬁne probabilities. A source of confusion is that we can make probabilistic statements about n _{t} , but in order to do so require we have a theory for probabilities. What is that theory?
If we are frequentists, we need to consider why it should apply to statements
about eg. Jesus. If it is Bayesian, well, there is your theory. There is no reason
to force an ad-hoc layer of interpretation on top of it.
5.1 There is good weather at inﬁnity
That the proposal is ﬂawed simply by the virtue of not allowing one to represent
a probability of 0.843 if one only know 3 other things at that conﬁdence (and
)
suppose how unfortunate we would be if we only knew 1 such thing or a a
probability of 1/ ^{√} 2 makes me suspect Dr. Carrier had intended some sort of limit statement, that is, using inﬁnities in some way.
A basic problem of using inﬁnities is the things we consider are not inﬁnite.
If we have two interpretations of assigning probabilities in the case of 3 coins
and the existence of 1 Jesus, and the ﬁrst only require us to consider 3 coins and
one Jesus while the second require us to consider an inﬁnite number of coins and Jesuses, I think there is ample reason to suspect the ﬁrst proposal is the more fundamental for the sole reason there was at most one Jesus. Nevertheless I will brieﬂy mention 3 ways to attempt to ”ﬁx” the proposal above by appeal to inﬁnities and simply notice there is no need for any similar
25
ad-hockery in a Bayesian interpretation. The ﬁrst is to propose we always know an inﬁnite number of things of any given probability. I think this proposal can be rejected on the grounds it is blatantly false. The second proposal is somewhat related to the ﬁrst, and that is that to make sense of any given probability of (say) 0.8, we immediately imagine an inﬁnite number of coin-ﬂips with biased coins that come up heads with probability 0.8 and deﬁne probability from this. I suspect it is hard to deﬁne this in a non- circular fashion (keep in mind random must be deﬁned in this context without using probability), but a worse problem is the chance of the event happening in the real world is irrelevant to the deﬁnition, since the limit will be entirely dominated by the inﬁnite number of hypothetical coins. Thus, the proposal has no normative power. Finally the proposal seem to simply be a fancy way of arriving at the number 0.8: Does the proposal eﬀectively diﬀer from simply saying a probability of 0.8 is taking a cake and dividing it in such a way the one part is in a ratio of 0.8 to the total and that’s the deﬁnition of probability? Put brieﬂy, I don’t see how the proposal has a normative eﬀect on how probabilities are used. The third proposal is going deeper into frequentist land and imagining an inﬁnite number of words in which we believe things at a probability of 0.8 and imagine the probability is deﬁned as how often things believed at a probability 0.8 turn out to be true in these worlds. This is basically the frequentist deﬁnition of probabilities, and contain all the illusions of circularity and fancy reasoning Bayesians usually object to, and has led frequentists themselves to object to the idea we can assign probability to things like Jesus raising from the dead. For instance, how does the inﬁnite number of worlds where the 6th digit of π is 3 look like?
5.2 The Bayesian/frequentist divide is not only about prob- abilities
Finally, I am not sure how the division between frequentists and Bayesians is resolved even if the proposal work. The division involve things such as if data is ﬁxed and parameters are variable, or if data is variable and parameters is ﬁxed. It involve frequentists objecting to applying Bayes theorem to things like those considered by Dr. Carrier, and it involve (at least some) Bayesians rejecting frequentist methods such as conﬁdence intervals and t-tests as blatant ad-hockery that should go the way of the Dodo. I simply do not see how adding a layer of frequencies on top of Bayesian language aﬀect the diﬀerence of opinion on these issues.
5.3 The big picture
Why should we accept Bayes theorem and it’s applications to questions like the book of Matthew being written by Matthew? If we do, it must be because of a
26
rigorous argument. I believe that eg. Cox and Jaynes provide such arguments, and it seem Dr. Carrier believe so as well, recall from chapter three:
The theorem was discovered in the late eighteenth century and has since been formally proved, mathematically and logically, so we now know its conclusions are always necessarily trueif its premises are true”
Though the claim us surprisingly not given a reference, Carrier himself suggest exactly Jaynes. But I think it is evident Dr. Carrier is in opposition to most of Jaynes philosophical points and assumptions from ﬁrst to last chapter. For instance, Dr. Carrier advance several diﬀerent notions of probability (probabil- ity, physical probability, epistemic probability, hypothized probability) and of frequencies. Suppose all of these are equivalent to what Jaynes call probability and frequency, in that case why confuse the language and not simply talk about probability and frequency? The most logical conclusion, which I repeat think is very evident from simply noting the diﬀerences in opinion I have documented above, is Dr. Carrier is in opposition to Jaynes and by extension Cox and most other Bayesian thinkers of the 20th century. In that case why should we think Bayes theorem hold? How do we set out to prove it? Simply pointing to the Kolmogorov Axioms wont cut it: Sure, that give us a mathematical theory of probabilities, but why suppose
it applies to historical questions any more than the theory of matrices?
The alternative is that Dr. Carrier is in agreement with eg. Jaynes and Cox and I have just been to sloppy to see it. For instance the re-interpretation of epistemic probabilities as frequencies is really just something added on top of the Bayesian framework. Well if it is just something we add and it has no normative eﬀect in terms of our calculations, I think Laplace reply is in order:
”[No, Sire,] I had no need of that hypothesis”.
6
Priors
The problem with interpreting probabilities as frequencies is in my opinion reﬂected through the book, for instance when Dr. Carrier propose how one should arrive at priors from frequencies. The problem can be summarized as this: Suppose you want to assign a prior probability to some event E. You observe E happening n times out of N . What is the prior probability p(E)? For probability to have a quantitative applicability to history it is crucial to arrive at objective ways of specifying prior probabilities. For instance in the example of the Criteria of Embarrassment we must be able to estimate numbers
such as P (P ) (the probability a gospel is preserved) or P (Em) (the probability
a story is embarrassing). Without such machinery Bayes theorem will just be
a consistency requirement without the ability to provide quantitative results.
To give a concrete example of how Dr. Carrier treats this problem consider the following from chapter six, in the section on determining a reference class:
27
If our city is determined to have had that special patronage, and our data shows 67 out of 67 cities with their patronage have public libraries, then the prior probability our new city did as well will now be over 98%.
Laplaces rule of succession is invoked here to arrive at the ﬁgure 98%, as it often is through the book, but without any consideration where it come from or if the speciﬁc assumptions are fulﬁlled. In fact, one would not get the impression from reading the book Laplaces rule is a Bayesian method at all, but I digress. Now consider the following example of a more elaborate problem on libraries in two provinces:
To illustrate this, the libraries scenario can be represented with this Venn diagram [see ﬁgure 1] In this example, P (LIBRARY |RC) =
Figure 1: Venn diagram from Proving History
0.80, P (LIBRARY |IT ) = 0.90, and P (LIBRARY |NP ) = 0.20. What’s unknown is P (LIBRARY |C), the frequency of libraries at the conjunction of all three sets. If we use the shortcut of assigning P (LIBRARY |C) the value of P (LIBRARY |NP ) < P (LIBRARY |C) < P (LIBRARY |IT ), that is, P (LIBRARY |C) can be any value from P (LIBRARY |NP ) to P (LIBRARY |IT ), then the ﬁrst concern is how likely it is that P (LIBRARY |C) might actually be less than P (LIBRARY |NP ), or more than P (LIBRARY |IT ), and the sec- ond concern is whether we can instead narrow the range. Given that we know Seguntium lacked special patronage, in order for P (LIBRARY |C ) < P (LIBRARY |NP ), there have to be regionally pervasive diﬀerences in the means and motives of veteran settlers in Italy - enough to
28
make a signiﬁcant diﬀerence from veteran settlers in the rest of the Roman empire. And indeed, on the other side of the equation, for P (LIBRARY |C) > P (LIBRARY |IT ) these deviations would have to be remarkably extreme, not only because P (LIBRARY |IT ) > P (LIBRARY |RC), but also because P (LIBRARY |RC) is already >> P (LIBRARY |NP ), which to overcome requires something ex- tremely unusual. Lacking evidence of such diﬀerences, we must as- sume there are none until we know otherwise, and even becoming aware of such diﬀerences, we must only allow those diﬀerences to have realistic eﬀects (e.g., evidence of a small diﬀerence in conditions cannot normally warrant a huge diﬀerence in outcome; and if you propose something abnormal, you have to argue for it from pertinent evidencewhich all constitutes attending to the contents of b and its conditional eﬀect on probabilities in BT). However, we would have to say all the same for P (LIBRARY |C) > P (LIBRARY |NP ), since we have no more evidence that P (LIBRARY |C) is anything other than exactly P(LIBRARY—NP). All we have is the fact that P (LIBRARY |IT ) is higher than P (LIBRARY |RC), but that in itself does not even suggest an increase in P (LIBRARY |NP ), and certainly not much of an increase. Thus P (LIBRARY |NP ) < P (LIBRARY |C) < P (LIBRARY |IT ) introduces far more am- biguity than the facts warrant. There is every reason to believe P (LIBRARY |C) ≈ P (LIBRARY |NP ) and no reason to believe being in Italy makes that much of a diﬀerence, especially as P (LIBRARY |IT ) is only slightly greater than P (LIBRARY |RC), which does sug- gest only a small rather than a large diﬀerence between Italy and the rest of the empire, and likewise we should expect the large dis- parity between P (LIBRARY |NP ) and P (LIBRARY |RC) to be preserved between P (LIBRARY |C) and P (LIBRARY |IT ), as the causes producing the ﬁrst disparity should be similarly operating to produce the secondunless, again, we have evidence otherwise. In short, NP appears to be far more relevant a reference class than IT in this case and should be preferred until we know otherwise. And if we also use a fortiori values (setting the probability at, say, 10 − 30%), we will almost certainly be right to a high degree of prob- ability. All this constitutes a more complex application of the rule of greater knowledge. When you have competing reference classes entailing a higher and a lower prior, if you have no information in- dicating one prior is closer to the actual (but unknown) prior, then you must accept a margin of error encompassing both, but when you have information indicating the actual prior is most probably nearer to one than the other, you must conclude that it is (because, so far as you know, it is). In short, we can already conclude that it’s so unlikely that P (LIBRARY |C) deviates by any signiﬁcant amount from P (LIBRARY |NP ) that we must conclude, more probably than not, P (LIBRARY |C) ≈ P (LIBRARY |NP ), regardless of the
29
diﬀerence between P (LIBRARY |IT ) and P (LIBRARY |RC). And as in this case, so in many others you’ll encounter.
I will admit after six readings I am still not quite certain what exactly is being
argued above and put generously I think it is another example of the books sometimes less than lucid style of writing. The reason why the argument is hard is because the problem is underdeter- mined, meaning one has to add additional assumptions to get a deﬁnite result. ^{5} What is perhaps surprising Bayes theorem was not invoked, according to which the analysis is actually fairly straight forward. All probabilities here are conditioned on RC. By Bayes theorem:
P (L|NP, IT ) = ^{P} ^{(}^{N}^{P}^{,} ^{I}^{T} ^{|}^{L}^{)} P(L)
P (NP, IT )
Thus the argument is actually fairly simple: Since almost all provinces that are NP do not have a library, P (L|NP ) = 0.2, and almost all provinces that are IT do, P (L|IT ) = 0.9, it follows that all things being equal, if a province has a library, there is less chance it is both NP and IT at the same time than otherwise. For instance, if we assume the distribution factorize, P (NP, IT ) = P (NP )P (IT ) and P (NP, IT |L) = P (NP |L)P (IT |L), then just applying Bayes theorem two times give:
P (L|NP, IT ) = ^{P} ^{(}^{N}^{P}^{,} ^{I}^{T} ^{|}^{L}^{)}
P (NP, IT )
_{P}_{(}_{L}_{)} _{=} P(L|NP)P(L|IT)
P(L)
_{=} 0.2 × 0.9
0.8
≈ 0.22.
which is in agreement with the discussion above. In reality one should of course never attempt such an argument. Clearly the relevant piece of information is the number of libraries in the provinces in addition to a number of other things we would know, and we should not simply assume independence or some other ad-hock handwaving argument to get a prior. It is diﬃcult to say what one should actually do. My ﬁrst advice to students would be to come up with a way to validate whatever method they came up with worked, but this is evidently very diﬃcult to do for historical problems. Since any computation using Bayes theorem relies on the estimation of many such probabilities it goes without saying this is an important diﬃculty.
7
Conclusion
Dr. Carrier attempt to apply Bayes theorem to problems in history, speciﬁcally the existence of Jesus. I emphasize I think this is an interesting idea, and while
I am uncertain we will ﬁnd Jesus (or not) at the end, I am sure Dr. Carrier can get something interesting out of the endeavour. Furthermore, the sections of the book which discuss history are both entertaining and informative. However
^{5} If the problem was actually the way it is stated, one should apply the method of maximum entropy (again see Jaynes).
30
the book should not be evaluated as a treatment of history but as advocating the use of Bayes theorem for advancing historical Jesus studies. However for this to be successful it must be because two conditions are met:
• The main problem keeping back Jesus studies is one of inference, not for instance forming hypothesis, discerning how a-priori plausible ideas are or psychology.
• We can easily pose the inference problems in a numerical format suited for Bayesian analysis; ie. the problems with granularity of language, for- malization of our thoughts and so on can be overcome.
Are these two conditions actually met? Unfortunately I feel the book contains great diﬃculties in it’s treatment of it’s various main claims which I have dis- cussed in the review and will summarize here:
• The proof historical methods reduce to the application of Bayes theorem is either false or not demonstrating anything which one would not already accept as true if a Bayesian view of probabilities is accepted as true.
• The problem that a throughout treatment of a historical problem will include a great many interacting variables with little chance of checking the modelling assumptions
• The practical assignment of probabilities (and determining proper refer- ence classes). The main example of assigning probabilities (the libraries example discussed above) relies on non-standard argument and cannot be said to be practical.
To convincingly make case Bayes theorem can advance history one needs lots and lots of worked-out examples. Unfortunately the book contains nearly none of these, and I would say the only time it tries to venture into the historical method –the case of the criteria of embarrassment– it does so in a fashion that is both distinctly non-Bayesian and without a way to encode something is actually embarrassing to the author. The book has even greater diﬃculties when it addresses foundational issues such as the proposed resolution of the frequentist and Bayesian view of prob- abilities. An important problem with this proposal is that it is ﬂawed by the
virtue of not being able to represent probabilities like 1/ ^{√} 2 if taken at face value, however even a ﬂawed suggestion could be interesting reading if it pro- vided the reader with a comprehensive and accurate account of the underlying problem and the current Bayesian resolution. Unfortunately this is not found in the book, indeed it is impossibly to ﬁnd (non-circular) deﬁnitions of the most basic concepts such as probabilities and frequencies within it. Instead the book introduces a plethora of important sounding terms (epistemic probabil- ity, hypothetical probability, true frequency, etc.) which are rapidly introduced together with elaborate though-experiments. For the most part these though- experiments fails to demonstrate anything concrete and worse may give the
31
unsuspecting reader the impression something important and widely accepted is being conveyed. This discussion could easily be extended to the many other oddities found in chapter six and it is diﬃcult not to get the impression sections of the book were written in a hurry.
32
Viel mehr als nur Dokumente.
Entdecken, was Scribd alles zu bieten hat, inklusive Bücher und Hörbücher von großen Verlagen.
Jederzeit kündbar.