Beruflich Dokumente
Kultur Dokumente
The actual application of assigning manuscripts to review- The Latent Semantic Indexing (LSI) analysis described by
ers involves two further considerations in addition to the Deerwester et al. [1990] uses singular-value decomposi-
matching of manuscripts and reviewers. First, one needs to tion (SVD), a technique closely related to eigenvector
guard against conflicts of interest by not assigning any decomposition and factor analysis [Cullum and Wil-
reviewers their own papers or those of close colleagues. loughby 1985]. The term-by-document matrix is decom-
Second, there is a need to balance the review assignments posed into a set of k , typically 100 to 300, orthogonal
to ensure that no single reviewer is overworked just factors from which the original matrix can be approxi-
because that person happens to be an appropriate choice mated by linear combination. Instead of representing docu-
for many papers, and that each paper gets assigned a cer- ments and queries directly as sets of independent words,
tain minimum number of reviewers. All these constraints LSI represents them as continuous values on each of the k
can be expressed as linear inequalities that can be handled orthogonal indexing dimensions. Since the number of fac-
by a linear programming package, so the entire review tors or dimensions is much smaller than the number of
assignment can in principle be done automatically. For the unique terms, words will not be independent. For example,
Hypertext’91 conference, however, these two final steps if two terms are used in similar contexts (documents), they
were done manually. will have similar vectors in the LSI representation. Simi-
larly, a word can be near a document that does not contain
Automated matching requires four elements, each of it if that is consistent with the overall pattern of word
which is discussed further below: usage in the collection. The LSI technique captures deeper
structure than simple term-term correlations or clustering
1. A representation of the domain is required by some and is completely automatic.
methods, including ours. We used a variety of written
materials about hypertext that were available in One can interpret the analysis performed by SVD geomet-
machine-readable form. rically. The result of the SVD is a k-dimensional vector
space in which each term and document is located. These
2. A representation of the submitted papers. Here, we
used their titles and abstracts. k-dimensional term and document vectors are simply the
left and right singular vectors of the SVD analysis. In this
3. A representation of the reviewers’ expertise. We used space the cosine or dot product between vectors corre-
abstracts supplied by the reviewers. Mostly, these were sponds to their estimated similarity. Since both term and
abstracts of papers written by the reviewers for previ- object vectors are represented in the LSI space, similarities
ous conferences or for journals, but some reviewers between any combination of terms and objects can be eas-
also sent other abstracts describing their interests. ily obtained. Retrieval proceeds by using the terms in a
4. A concrete algorithm to compute the similarity between query to identify a point in the space, and all documents
submissions and reviewers on the basis of the above are then ranked by their similarity to this query vector.
three representations. We used Latent Semantic Index-
ing (LSI) [Deerwester et al. 1990], but other informa- The LSI method has been applied to several standard infor-
tion retrieval methods such as vector or probabilistic mation retrieval collections with favorable results. The LSI
retrieval could also have been used. method has equaled or outperformed standard vector meth-
ods and other variants in every case, and was as much as
The LSI Method 30% better in some cases. As with standard vector meth-
Latent Semantic Indexing (LSI) is an extension of the vec- ods, differential term weighting and relevance feedback
tor retrieval method (e.g., [Salton and McGill 1983]) in can improve LSI performance substantially [Dumais
which the associations between terms are explicitly taken 1991]. For the experiments reported in this paper, we used
into account. This is done by modeling the association of an entropy term weighting which is known to produce
terms and documents. LSI assumes that there is some good retrieval performance [Salton and McGill 1983].
underlying or “latent” structure in the pattern of word
Assessing the Review Assignments
* The OCR files contained errors even after they had been
In order to assess the goodness of fit between the auto-
cleaned up by humans in several independent passes. For mated assignments and the reviewers’ true interests, we
future use of automated review assignment, we recommend asked the members of the Hypertext’91 papers committee
having the authors provide machine-readable files directly by to estimate the degree to which they would be good
email or diskette if at all possible. reviewers for each of the 117 submitted papers. All 25
Table 1 Summary of the datasets used to build the domain representation of the hypertext field.
analysis. These articles contain 4,511 hypertext nodes in cate journal or conference name, publisher, year, etc. On
the form of frames. Another analysis treated the 4,511 Nov. 26, 1991 this set contained 1,725 entries.
hypertext frames as “documents.” Note that the size
(Kbytes) of the ACM dataset is slightly larger when it is A final analysis used the union of all pre-existing datasets
split up by hypertext nodes. This is because we repeated (the three books, ACM Frames, and Perlman's bibliogra-
the title of the parent article in each of its nodes. This repe- phy).
tition ensures that nodes from the same article will have
some term overlap and thus a somewhat higher similarity. The size and coverage of these databases varies consider-
In principle, we could have taken advantage of the hyper- ably, from the .14 Mbyte collection of 117 submitted
text links in a similar manner to increase the surface-level abstracts to the 7.46 Mbyte collection of 8,629 documents
similarity for connected nodes in different articles. which represents the union of all pre-existing datasets. For
our experiments, each database was processed in a com-
Perlman’s HCI bibliography [Perlman 1991] is a more pletely automatic fashion. A term by document matrix was
general bibliographic database of articles about Human– generated for each collection. Cell entries represent the
Computer Interaction, many of which are directly about or frequency with which a given term occurs in a given docu-
relevant to the field of hypertext. This database contains ment multiplied by the entropy weight for that term. An
the titles, authors and usually the full text of the abstract SVD analysis was performed on this matrix and the k-larg-
for papers in relevant journals, conferences, and edited est singular values and their corresponding term and docu-
books. For many entries, additional fields are used to indi- ment vectors were used for retrieval. The number of
Table 2 Performance of the automated review assignment with various domain representations and dimensions. All results
are based on matching the reviewers’ abstracts with the submitted abstracts and then normalizing the results for
those reviewers who supplied multiple abstracts.
dimensions, k, varied from 50 to 500 in the experiments In general, performance is well above chance; the results
reported here. are consistent for the four different measures of perfor-
mance; and there is somewhat less difference in perfor-
Results mance across the domain representations than we
Table 2 summarizes the results of some of our experi- expected. Averaged over all datasets and dimensions
ments. For each dataset, we present results from the four (some of which we know to be sub-optimal), the mean
different measures of performance described earlier. For number of relevant articles in the top 10 articles returned
each dataset, we present results from two different LSI by our automated method is 5.9. So, more than half of the
spaces, one with 100 dimensions (which previous research articles are rated as at least a “good match” given our rat-
indicates is a reasonable choice), and smaller or larger val- ing scale. This represents an advantage of 48% over what
ues than this depending on the dataset. As in previous would have been obtained with random assignment of arti-
experiments [Dumais 1991], we found a large initial cles to reviewers (an average of 4.0 relevant in the top 10).
increase in performance as the number of dimensions The average three-point precision value is .51. This repre-
increased, a peak at some intermediate value, and then a sents a 24% advantage over random assignment (for which
gradual decline to a level that corresponds to a standard precision was estimated using 100 different random order-
vector retrieval method. ings of articles). The average rated relevance of the top 10
returned articles is 5.75, where a rating of 6 or above indi-
cates a “good match.” This represents a 22% advantage
Table 3 Performance of the automated review assignment method when “All Datasets” are used for the domain representa-
tions. The first row is based on matching the reviewers’ abstracts with the submitted abstracts and then normalizing
the results for those reviewers who supplied multiple abstracts. The second row is based on matching the reviewers’
family names with the submitted abstracts.
are represented by a single point (as the combination of application, the reviewers’ multiple abstracts were often
their abstracts) or as several separate points (as the individ- quite similar to each other. When the split representations
ual abstracts). Sometimes one method is slightly better, are not very different from each other or from the com-
and sometimes the other method is slightly better. Aver- bined representation, small performance differences are to
aged over the measures of performance shown in Table 2, be expected. In other applications such as information fil-
the split-norm method is 0.6% better than the combined tering where a single person may have several quite dispar-
method. ate interests, we believe that the advantages of the split
methods would be larger.
Not normalizing the ratings before combining the values
for each reviewers’ abstracts generates worse results in For practical applications, it may sometimes be difficult to
practically all the cases we have looked at. Averaged over gather machine-readable abstracts from all the members of
the measures of performance shown in Table 2, the split- a review committee. Therefore, we have also considered a
norm method is 3.5% better than the split-nonorm method. minimalist method of representing reviewers by their fam-
The data in Table 2 represent the use of the split-norm ily names and nothing else. This simple method is feasible
method which we chose due to its overall advantage over when the reviewers’ names occur as terms in the dataset
the combined and split-nonorm methods. used for the domain representation. This will often be the
case for research conferences where reviewers are mostly
people who have published extensively and are cited fre-
We believe that the advantage of normalizing arises
quently by other authors.
because it, in effect, gives equal weight to all reviewers’
abstracts. In some cases, the lengths of the reviewers’ Representing reviewers by their name can be done for 24
abstracts varied quite a bit. If the cosines are not normal- of the 25 reviewers (14 of the 15 for which full relevance
ized before combining, shorter abstracts will tend to have ratings were collected). Unfortunately, the name of one of
higher similarities with target documents—i.e., if two the reviewers did not occur as a term in the pre-existing
abstracts of different lengths have the same overlap (dot datasets. There are many ways of handling this problem.
product) with a document, the shorter abstract will have a We chose a simple and general method—we added some
higher cosine with the document. Thus, in cases where the text containing this reviewer’s name to the database. For
reviewers’ abstracts vary in length, shorter abstracts can our particular application, we added the reviewers’ self-
dominate the combined ranking for partly artifactual rea- descriptive abstract to the database. We then use the loca-
sons. By normalizing the cosines for each of the reviewers’ tion of the abstract to determine where the new term (the
abstracts before combining, these effects are minimized reviewer’s name in this case) is located. Since terms in the
and the distance in the LSI space between the query and LSI space are at the centroid of the documents which con-
submitted abstracts determines the combined ranking. tain them, this is roughly the same location that would
have been obtained if the database contained this abstract.
Although the performance differences between the split
and combined representations are not large in the present Table 3 shows the results of representing reviewers with
experiment, we have reason to believe that they would be their names only, when All Datasets are used to form the
larger and important in other contexts. Salton and Buckley domain representation. The results are somewhat worse
[1990] and Kane-Esrig et al. [1989] have suggested that than most of the analyses using the reviewers’ full
retrieval performance can be improved when heteroge- abstracts. The use of full abstracts is 4.3% better on aver-
neous texts are split into smaller units. The advantage of age than using names only (averaged over the performance
such an approach will be greatest when the original text is measures shown in Table 3). Review assignments using
quite heterogeneous (i.e., when the inter-item similarity of the reviewers’ names as the only representation of their
the smaller units of the split representation is low). In our interests is thus a feasible fall-back method if abstracts are
4
0 10 20 30 40
Number of Papers per Reviewer
Figure 1 Curves showing how the mean relevance ratings of the assigned papers drops as more papers are assigned to
each reviewer when using different datasets for the domain representation. All curves (except the top and bottom
curves) have been calculated with the number of LSI dimensions being equal to 100. The top curve shows the the-
oretically optimum relevance rating that would be achieved by perfectly matching the reviewers’ own ratings. The
n of 2n curve (stippled gray) shows the result of sending the reviewers twice as many papers as they are asked to
review and having them pick half of their review assignment themselves from among these papers. The black dots
indicate the performance of three humans experts who simulated the program chair’s job.
unavailable for some reviewers, but it is not recommended Unless otherwise noted, all results reported in this paper
as a general approach. represent the reviewers by several independent abstracts
for which the cosine similarity measures have been nor-
One obvious problem with representing the reviewers by malized before combining (the split-norm method).
their family name only is that several different individuals
may have the same name even though they have widely Varying the Number of Papers
differing research interests and review expertise. The Assigned to Each Reviewer
Hypertext’91 reviewers mostly had fairly rare family
names that to our knowledge are not shared by other active Figure 1 compares the mean relevance rating for various
researchers in the hypertext field. It is likely, however, that assignment methods as more and more papers are assigned
some name overlap would occur if the same method were to each reviewer. The figure has curves for three of the
to be used for broader fields. automatic LSI methods listed in Table 2 (All Datasets, the
117 submitted abstracts, and the Shneiderman & Kearsley
One can also represent reviewers by the keywords given book). Figure 1 also contains curves for the n of 2n method
for their papers rather than the full text of their abstracts. discussed further below, and for perfect matches and ran-
dom assignment.
Doing so gives slightly better performance than using the
reviewers’ names only, but using the full abstracts is still
4.1% better than using the keywords.
Mean Rating of Papers Assigned by the LSI Method (All Datasets, n=100) 5.33 5.33 5.31
Table 4 Comparison of the performance of human experts with two of the automated methods.
achieves the same performance as our Expert 1. As indi- In our Hypertext’91 case study, the reviewers’ mean rele-
cated by Table 4, the n of 2n method does achieve better vance rating was 6.15 for the top seven papers assigned to
performance than all the human experts. them by our automated review assignment based on All
Datasets and using 100 dimensions in the scaling. This
As noted above, our human experts created unbalanced number is certainly substantially lower than the 6.95 rating
review assignments since they were asked to find the best from the manual CHI’92 paper assignment but it is at least
reviewers for each paper without having to ensure that the in the same general range of the ten-point rating scale. Our
reviewers received the same review load or that each paper n of 2n method based on All Datasets and 100 dimensions
was reviewed by the same number of reviewers. It is thus achieves a 6.72 mean relevance for n=6 and 6.74 for n=8.
fair to compare the performance of the human experts with This is still lower than the 6.95 CHI’92 rating, but not
the performance of our unbalanced automated assignment. much so.
As noted above, it is possible to add constraints to the
automated methods fairly easily and to use an efficient Conclusions
optimization algorithm that satisfies the constraints. The
assignment generated in this way resulted in only a small It is possible to automate the assignment of submitted
decrease in mean relevance. In contrast, a human would manuscripts to relevant reviewers using information
probably apply the balancing constraints in a sub-optimal retrieval methods. When reviewers are sent only the manu-
manner since it is virtually impossible for a human to con- scripts they are asked to review, even the best automated
sider all the relevant combinations. We would thus expect methods perform somewhat worse than manual assign-
the relative performance of the automated methods to ments by human experts, but they do generate fairly good
improve compared to human experts when balancing con- results. A new method called n of 2n results in substan-
straints are also considered. tially higher mean ratings and is better than human experts.
This method involves sending reviewers twice as many
manuscripts as they are actually asked to review and
Another estimate of the performance of human experts
allowing them to choose half of their review assignment
assigning reviews can be derived from the ACM CHI’92
themselves from the papers they are sent (the other half is
conference on computer–human interaction. Admittedly, pre-assigned to guarantee coverage of all papers). This
this conference is different from the Hypertext’91 confer- method may also have some motivational advantage over
ence in many ways, so a direct comparison would not be traditional assignment methods where the reviewers have
reasonable. The CHI conference is somewhat more estab- no choice.
lished and covers a larger field, in terms of number of sub-
missions (312), number of reviewers (137), and scope of Although we applied the n of 2n method to the assignment
topics addressed by the conference. of submitted conference papers to reviewers, we believe
that it is also applicable to other cases where one has a
For the ACM CHI’92 conference, submissions were manu- large set of information objects that need to be matched
ally assigned to reviewers by the papers chair with the with a large set of humans such that each human gets
assistance of a small number of area specialists for certain assigned a small number of objects. A further condition for
sub-fields. After their meeting, 97 members of the papers the applicability of the method is the ability of the humans
committee for whom electronic mail addresses were avail- to quickly estimate the relevance of the information
able were asked to provide estimates of the appropriate- objects they must select from.
ness of having them review the papers they had been
assigned to review, using the same 0–9 scale used for the For the Hypertext’91 case study, the performance of the
Hypertext’91 reviewers. Replies were received from 45 automated review assignments was fairly independent of
reviewers, for a response rate of 46%. On average, the the dataset picked to generate the initial domain represen-
reviewers had been assigned 6.8 papers to review, and their tation. Of course, we only tested datasets that were rele-
mean relevance rating was 6.95. vant for the hypertext domain, and this field is still fairly