Beruflich Dokumente
Kultur Dokumente
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/243773802
CITATIONS READS
158 222
2 authors:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by R. S. Forsyth on 02 April 2015.
were:
An example study using the Federalist Papers as a test
case is that of McColly and Weier (1983) who advocate (100 log AQ
a likelihood-ratio approach to attribution based on the 0) R = L-
assumption that the frequency of occurrence of a parti-
cular marker word is distributed as a Poisson random
variable. Their study was directed at the Middle English (ii) Yi
Pearl-poems, but to validate the underlying assump- v
tions of their methodology they turned to the Federalist 10 4 (S?= i i2Vj - AQ
Papers. Six essays from the Federalist corpus were (hi) K =
selected, two of Hamilton (numbers 60 and 66), two of N2
Madison (numbers 14 and 45) and two of the disputed (iv) and (v) Parameters a and 6, the standardized
group (numbers 53 and 57). Thirteen of the thirty Mos- values of the parameters from the Sichel (1975) dis-
teller and Wallace marker words were chosen, words tribution, a theoretical model for vocabulary distri-
which occurred once per 1000 in at least one essay, and butions, V{. The slope of the head of the distribution
the likelihood ratio test on these thirteen variables was is defined by a, whilst that of the tail by 6.
114 Literary and Linguistic Computing, Vol. 10, No. 2, 1995
As recommended by Sichel, parameters a and 6 were readily obtained from both the OCP and SICHEL out-
computed from the observed vocabulary distributions puts.
for nouns only, and it was more practicable also, to As a check against any dependence on text length,
evaluate K from the noun distributions. These five the six variables were correlated against text length N.
variables, therefore, collectively cover the whole struc- The only significant correlation coefficient obtained
ture of a writer's vocabulary frequency distribution. was that between V2IV and N (0.449). Although Sichel
To these we now propose to add a sixth variable, the (1986) shows stability of the proportion of hapax dis-
W index proposed by Brunet (1978) as a measure of legomena within an author for 1000 < N < 400,000, we
vocabulary richness. The index is defined by may be too near the lower end of this range with the
W = A^~" Federalist papers as compared with the Mormon texts
previously examined where N was around 10,000. It
where a is a constant in the range 0.165 to 0.172. Here was accordingly decided to omit variable P V ^ from the
we use a = 0.170. Brunet claims that values of W are analysis. One interesting result was that the correlation
relatively unaffected by text length and that it is author- coefficient between the type-token ratio (V/N) and N
specific. was 0.705, confirming the fact that any stylometric
To apply vocabulary richness techniques to the study using this ratio should always work with equal
Federalist problem a sample of thirty-three papers was size text lengths. Table 6 lists the values of N, a, 9, K,
shown. All twelve Disputed papers were selected, R and W for the textual samples.
along with the three Joint papers, all five Jay papers, a The first objective of this analysis must be to find the
random sample of eight Hamilton papers (numbers 7, natural groupings, if any, amongst the textual samples.
11, 21, 24, 31, 36, 70, and 72) and a random sample of Statistical grouping techniques such as cluster analysis
five Madison papers (numbers 10, 39, 40, 42, and 44). seek to form 'clusters' or 'groups' of individuals such
All thirty-three texts were run on the Oxford Concord- that individuals within a cluster are more similar in
ance Program, after which from each word list nouns some sense than individuals from other clusters. The
were manually tagged, lemmatized (singular and plural result may be presented in the form of a dendrogram,
forms subsumed) and extracted into separate vocab- a two-dimensional diagram illustrating the mergers
ulary distributions. The thirty-three distributions which have been made at each stage of the procedure.
obtained were arranged in identical groupings for Figure 1 shows the dendrogram computed by applying
comparability and the SICHEL computer program (see an average-linkage cluster analysis to the twenty-one
Holmes, 1992) was then run on each resultant data set. Federalist samples obtained by excluding, temporarily,
The values of the six vocabulary richness variables were the disputed papers. The five-variable vocabulary
Paper N a e K R W
C A S E 0 10 15 20 25
Label Seq
Jointl9 7
Hamilton24 11 J
Hamilton21 15
Hamilton31 16
Hamilton7 9
Hamiltonll 10
Jay2 1
Hamilton70 14
Jointl8 6
Joint20 8
Jay4 3
Hamilton72 12
Jay5 4
Jay 6 4 5
MadisonlO 21 J
Hamilton3 6 13
Jay3 2
Madison42 19
Madison40 18
Madison39 17
Madison44 20
C A S E 0 10 15 20 25
Label Seq
Disputed55 7
Disputed57 22
Madison42 18
D_isputed50 2
Hamilton36 12
Disputed58 23
MadisonlO 20
Disputed63 25
Disputed62 24
Disputed52 4
Disputed53 5
Disputed49 1
Madison40 17
Hamilton21 14
Hamilton31 15
Hamiltonll 9
Hamilton7 8
Hamilton24 10
Hamilton70 13
Disputed54 6
Madison3 9 16
Disputed51 3
Madison44 19
Disputed56 21
Hamilton72 11
Fig. 2 Dendrogram including Disputed papers.
Madison M
Hamilton H
1.5
1 -
0.5 -
0 -
-0.5 -
oQ-
CM
-1 - M
-1.5
-2 -
-2.5 -
-3 -
-3.5
PC1
Fig. 3 P C A plot showing the thirteen papers by Hamilton and Madison.
Literary and Linguistic Computing, Vol. 10, No. 2, 1995 117
1
Jay J
Joint C
1.5 Hamilton H
Madison M
0.5
H J
M
eg M
2 -0.5
-1.5
-2
-2.5
-3 -2 -1 0 PC1 1
Fig. 4 P C A plot including the Jay and Joint papers.
4 -i i
Disputed D
Hamilton H
H Madison M
3
2 - -
D M
1 - D
M
D D -
H
0 H D D M M
H D D
H
H D
H -
-1
"h D
M
D
-2 D
-3 i i
-3 -2 -1 0 PC1 1 2 3 4
seem to be a good set of discriminators in a collective directions for stylometry, Burrows (1992) employs
sense. what is essentially principal components analysis on the
frequencies of occurrence of sets (typically 50) of com-
mon high-frequency words. His study was applied to a
3. Word Frequency Analysis
wide variety of authors and genres and achieved
In a ground-breaking study which we believe sets new remarkable results as regards the clustering patterns of
118 Literary and Linguistic Computing, Vol. 10, No. 2, 1995
textual samples in the space of the first two principal manner but exclusively used only the thirty Mosteller
components. We used this technique on the Federalist and Wallace marker words listed in Table 8.
Papers, using the same samples except for the random
omission of three Hamilton papers (numbers 21, 31, Table 8 The thirty Mosteller and Wallace marker words
and 36) to reduce the burden of computation and to
balance the genuine Hamilton and Madison papers at according considerable(ly) probability
five each. also
although
direction
enough
there
this
Two separate investigations were conducted. The always innovation though
first used a set of forty-nine words from among the list an kind to
of ninety high and medium-frequency words used as an apt language upon
initial pool by Mosteller and Wallace. Many of these both matter(s) vigorous
ninety words appear only rarely in some of the papers, by
commonly
of
on
while
whilst
hence the reduction to forty-nine most commonly consequently particularly work(s)
occurring high-frequency function words. This proce-
dure produces a list almost the same as would be
obtained by Burrows' method of choosing the fifty most Figure 6 shows the Hamilton and Madison papers in
common words. Table 7 shows these forty-nine high- the space of the first two principal components for the
frequency words. A large spreadsheet was compiled of forty-nine word data set. This plot accounts for 74.6%
the (49 x 30) word occurrences and these were stan- of the variation in the original forty-nine-dimensional
dardized for all the thirty papers by computing occur- matrix and the first principal component contrasts the
rence rates per 1000 words. use of 'the' against 'to', 'would' and 'a'. The Madison
The second investigation proceeded in a similar papers cluster well but are joined by Hamilton 70. Use
of the thirty marker word data set in the space of the
Table 7 Forty-nine high-frequency words first two principal components produces the scatter plot
shown in Figure 7.
a by is one this If we now incorporate the Jay papers and the Joint
all can it only to papers, we obtain the principal components plot shown
an every its or was
and for may so were in Figure 8 for the forty-nine-word set, which accounts
are from more some which for 69.2% of the total variation. This is a remarkable
as has must such who plot, with clear clusterings evident for the Madison,
at have no than will Joint and Jay papers. The first principal component
be if not that with contrasts usage of 'the' against 'and' and 'would', whilst
been in of the would
but into on their the second principal component contrasts 'and' against
'of and 'a'. This impressive result adds weight to
45 , ( r i 1 1 r 1
H Hamilton H
Madison M
40
- -
H
35 - -
30 M
C\
H
o 25
M -
20
M
M
H
H
15
-
10 -
M
5 1 i i i i i i_
20 30 40 50 60 PC1 70 80 90 100 110
-48 - -
M
-50 - M -
M
-52 - -
-
CvJ
O -54 -
-56 - M -
H
-58 - -
H
-60 - H -
H
-62
H
-64
-20 -15 -10 PC1 -5
30 1 i r i i 1 T'
H Hamilton H
Madison M
25 - Joint C .
H Jay J
H
20 H
15
10 -
CVJ H
OQ.
- M
M
M
M
J
C
C
j
-10 J
J C
-15 1 ,1 i i i i i
20 30 40 50 60 PC1 70 80 90 100 110
Fig. 8 Hamilton, Madison, Jay, and Joint papers with forty-nine words.
Burrows' contention that mutivariate word frequency obtain a plot in which the clusterings are less well
analysis of large sets of common words is a stylometric defined.
technique which can discriminate between writers, The third set of results concerns the subset of the
even in such notoriously difficult cases as the Federalist Hamilton, Madison and Disputed papers. Figure 9
Papers. Somewhat surprisingly, if we substitute the shows the principal component plot for the forty-nine-
thirty marker words for the forty-nine-word set we word set, which now only accounts for 53.7% of the
120 Literary and Linguistic Computing, Vol. 10, No. 2, 1995
15 1 7' I i i 1
i 1 D
Hamilton
Disputed H
H
10 - - Madison M
5 - - -
H M
D
D
H
CV1
O -5 - M
-
H ft
-10 -
H
D
D
-
D D M
-15 - D -
D M
-20
D
i i i
-25
10 20 30 40 50 60 70 80 90 100
PC1
total variation. Even so, the Disputed papers fall firmly Accordingly, our third re-analysis of the Federalist
in the Madison part of the plot along with Hamilton 70, Papers uses a machine-learning package but in keep-
a result in agreement with the conclusion of Mosteller ing with our intention of exploring some relatively
and Wallace. The smaller data set of the thirty marker unconventional techniques in stylometry it does not
words produces a similar but less clear pattern, which involve a neural network. Instead we use a rule-finder
though the first two components explain 77.1% of the package based on a 'genetic algorithm', the PC/
variation. Once again we obtain somewhat tighter clus- BEAGLE system (Forsyth, 1987), to seek relational
tering with the larger set of words, even though they are expressions characterizing the authorial styles of
not specifically chosen for their discriminatory power. Hamilton and Madison.
falling into the relevant rows of the calibration table. Pro-Hamilton Rules:
It will be seen that all twelve papers have odds factors Rule 1: ((ON< = 6.9517)&((56.5475 - OF) < 2.0729))
greater than 1, meaning that the posterior odds in Rule 2: ((BY - TOX = -25.7484)
favour of Madison as opposed to Hamilton are greater Pro-Madison Rules:
Rule 1: (ON> = 6.9202)
than the odds before this evidence was obtained. For Rule 2: l(((IN - AND) > = -6.6181)|((THE7B < WOULD) = WAS)
seven of these disputed papers, the majority, the MOF
is between 35.6571 and 86.1714. If we accept, for the
sake of argument, the composite odds factors (which,
if biased at all are likely to be biased towards under- Applied to the Disputed papers the joint outcomes
statement in whichever direction they point) then the from both rule sets were as shown in Tables 18 and 19.
results range from odds factors of 3.57 to 60.91 in Here the majority of assignments are much less confi-
favour of Madison. dent than in Section 4.3; and we also have two definite
In the analysis by Mosteller and Wallace which this mistakes, on paper number 55 (which Mosteller and
study most closely resembles, their 'Robust Bayesian Wallace found the least Madisonian of the twelve) and
Analysis', the odds factor of the most clearly Madisonian number 63 (the last that Madison contributed to the
paper, number 54, were 169,396 a far more con- series). It would seem that HERB did benefit in the
clusive attribution than ours. However the lowest first experiment from having specially selected marker
124 Literary and Linguistic Computing, Vol. 10, No. 2, 1995
Table 18 Results of both rule-sets with forty-nine function words Table 20 Pro-Madison rule evaluated with mean rates
Pro-Madison M Ml M+ Hamilton mean rates:
Pro-Hamilton 3.35 - 0.47 < 0 =s> False
H+ 55 Madison mean rates:
HI 63 52,53 0.14 - 1.08 < 0.48 => True
H- 49,50,51, 56
54,57,58,62