Sie sind auf Seite 1von 12

Automating the Assignment of

Submitted Manuscripts to Reviewers


Susan T. Dumais and Jakob Nielsen
Bellcore
445 South Street
Morristown, NJ 07960
USA
Email: std@bellcore.com and nielsen@bellcore.com

Abstract time and at a more or less leisurely pace. For conference


submissions and other forms of grant proposals (e.g., NSF,
The 117 manuscripts submitted for the Hypertext’91 confer-
Esprit), however, the reviews and review assignments
ence were assigned to members of the review committee,
using a variety of automated methods based on information must be completed under severe time pressure, with a very
retrieval principles and Latent Semantic Indexing. Fifteen large number of submissions arriving near the announced
reviewers provided exhaustive ratings for the submitted deadline, making it difficult to plan the review assignments
abstracts, indicating how well each abstract matched their much in advance.
interests. The automated methods do a fairly good job of
assigning relevant papers for review, but they are still some- These dual problems of large volume and limited time
what poorer than assignments made manually by human make the assignment of submitted manuscripts to review-
experts and substantially poorer than an assignment per- ers a complicated job that has traditionally been handled
fectly matching the reviewers’ own ranking of the papers. A by a single person (or at most a few people) under quite
new automated assignment method called “n of 2 n ” stressful conditions. Also, manual review assignment is
achieves better performance than human experts by sending only possible if the person doing the assignments (typi-
reviewers more papers than they actually have to review cally the program chair for the conference) knows all the
and then allowing them to choose part of their review load members of the review committee and their respective
themselves. areas of expertise. As some conferences grow in scope
with respect to number of submissions and reviewers as
Keywords: Conferences, Program Committees, Review-
ers, Referees, Manuscripts, Papers, Assignment, Matching, well as the number of sub-domains of their fields, it would
Interests, Latent Semantic Indexing, LSI, Hypertext, Infor- be desirable to develop automated means of assigning the
mation Retrieval. submitted manuscripts to appropriate members of the
review committee.*

We tried using one such automated information retrieval


Introduction method to help the program chair for the ACM Hyper-
text’91 conference (San Antonio, TX, 15–18 December
The assignment of submitted manuscripts to reviewers is a 1991) in assigning the 117 submitted manuscripts to 25
common task in the scientific community and is an impor- members of the program committee. Before the submis-
tant part of the duties of journal editors, conference pro-
gram chairs, and research councils. Finding reviewers for
* A related issue (just seen from the reverse perspective) is the
journal submissions and some types of grant proposals can
normally be done for a small number of submissions at a problem of matching funding agencies to research projects
[Cohen and Kjeldsen 1987]. In general, the principles
Reprinted from: described in this paper should be relevant whenever one has
Proceedings of the ACM SIGIR’92 15th International Confer- two large sets of objects that need to be matched such that each
object in one set gets assigned a small number of objects from
ence on Research and Development in Information Retrieval
the other. Other examples include the assignment of press
(Copenhagen, Denmark, 21–24 June 1992), 233–244. releases to newspaper reporters and summer student job appli-
 1992 by the ACM, reprinted by permission. cations to researchers with job openings.
0-89791-524-0/92/0006/0233...$1.50

Automating the Assignment of Submitted Manuscripts to Reviewers 233


sion deadline, the reviewers were asked to send us usage across documents, and uses statistical techniques to
abstracts of their papers to be used as descriptions of their estimate this latent structure. A description of terms, docu-
sub-areas of interest and expertise within the hypertext ments, and queries based on the underlying latent semantic
field. After the deadline, the abstracts of the submitted structure (rather than surface level word choice) is used for
papers were scanned into a computer and subjected to opti- representing and retrieving information. See [Deerwester
cal character recognition to generate machine-readable et al. 1990; Furnas et al. 1988] for mathematical details
files.* and examples.

The actual application of assigning manuscripts to review- The Latent Semantic Indexing (LSI) analysis described by
ers involves two further considerations in addition to the Deerwester et al. [1990] uses singular-value decomposi-
matching of manuscripts and reviewers. First, one needs to tion (SVD), a technique closely related to eigenvector
guard against conflicts of interest by not assigning any decomposition and factor analysis [Cullum and Wil-
reviewers their own papers or those of close colleagues. loughby 1985]. The term-by-document matrix is decom-
Second, there is a need to balance the review assignments posed into a set of k , typically 100 to 300, orthogonal
to ensure that no single reviewer is overworked just factors from which the original matrix can be approxi-
because that person happens to be an appropriate choice mated by linear combination. Instead of representing docu-
for many papers, and that each paper gets assigned a cer- ments and queries directly as sets of independent words,
tain minimum number of reviewers. All these constraints LSI represents them as continuous values on each of the k
can be expressed as linear inequalities that can be handled orthogonal indexing dimensions. Since the number of fac-
by a linear programming package, so the entire review tors or dimensions is much smaller than the number of
assignment can in principle be done automatically. For the unique terms, words will not be independent. For example,
Hypertext’91 conference, however, these two final steps if two terms are used in similar contexts (documents), they
were done manually. will have similar vectors in the LSI representation. Simi-
larly, a word can be near a document that does not contain
Automated matching requires four elements, each of it if that is consistent with the overall pattern of word
which is discussed further below: usage in the collection. The LSI technique captures deeper
structure than simple term-term correlations or clustering
1. A representation of the domain is required by some and is completely automatic.
methods, including ours. We used a variety of written
materials about hypertext that were available in One can interpret the analysis performed by SVD geomet-
machine-readable form. rically. The result of the SVD is a k-dimensional vector
space in which each term and document is located. These
2. A representation of the submitted papers. Here, we
used their titles and abstracts. k-dimensional term and document vectors are simply the
left and right singular vectors of the SVD analysis. In this
3. A representation of the reviewers’ expertise. We used space the cosine or dot product between vectors corre-
abstracts supplied by the reviewers. Mostly, these were sponds to their estimated similarity. Since both term and
abstracts of papers written by the reviewers for previ- object vectors are represented in the LSI space, similarities
ous conferences or for journals, but some reviewers between any combination of terms and objects can be eas-
also sent other abstracts describing their interests. ily obtained. Retrieval proceeds by using the terms in a
4. A concrete algorithm to compute the similarity between query to identify a point in the space, and all documents
submissions and reviewers on the basis of the above are then ranked by their similarity to this query vector.
three representations. We used Latent Semantic Index-
ing (LSI) [Deerwester et al. 1990], but other informa- The LSI method has been applied to several standard infor-
tion retrieval methods such as vector or probabilistic mation retrieval collections with favorable results. The LSI
retrieval could also have been used. method has equaled or outperformed standard vector meth-
ods and other variants in every case, and was as much as
The LSI Method 30% better in some cases. As with standard vector meth-
Latent Semantic Indexing (LSI) is an extension of the vec- ods, differential term weighting and relevance feedback
tor retrieval method (e.g., [Salton and McGill 1983]) in can improve LSI performance substantially [Dumais
which the associations between terms are explicitly taken 1991]. For the experiments reported in this paper, we used
into account. This is done by modeling the association of an entropy term weighting which is known to produce
terms and documents. LSI assumes that there is some good retrieval performance [Salton and McGill 1983].
underlying or “latent” structure in the pattern of word
Assessing the Review Assignments
* The OCR files contained errors even after they had been
In order to assess the goodness of fit between the auto-
cleaned up by humans in several independent passes. For mated assignments and the reviewers’ true interests, we
future use of automated review assignment, we recommend asked the members of the Hypertext’91 papers committee
having the authors provide machine-readable files directly by to estimate the degree to which they would be good
email or diskette if at all possible. reviewers for each of the 117 submitted papers. All 25

234 Dumais and Nielsen


members of the papers committee were given a question- mated methods. Before the submission deadline, the
naire reprinting the abstracts for each of the 117 papers, reviewers sent us abstracts which described their interests
and 15 members returned the questionnaire (60% response and areas of expertise. After the program committee meet-
rate). ing, the reviewers were given a questionnaire asking them
to rate the relevance of each of the submissions to their
Each reviewer rated the relevance of each paper based on interests and expertise. In many cases, reviewers rated
its abstract, using a 0–9 scale with 9 indicating the best
quite highly some abstracts about topics not covered in
match between the paper and the reviewer’s competence
and interests. The anchor points for the scale were: their self-descriptions. Often these represented cases
where the reviewer simply failed to give us an abstract
8–9 = right up my alley about an area of interest. We could have asked our review-
ers to match their self-descriptive abstracts against the sub-
6–7 = good match
missions (as the automatic methods do), but this is a
4–5 = somewhat relevant difficult and unrealistic task. It also points to the important
2–3 = I’m following it, sort of problem of getting reviewers (and, more generally, users of
information retrieval systems) to specify appropriate que-
0–1 = how did I get this one?
ries.
We should note that relevance judgments are context
dependent. Studies of fisheye interfaces [Furnas 1986] In addition to the practical problem of solving the reviewer
found that academicians make much bigger distinctions assignment problem, this case study will be used to
between disciplines close to their own: e.g., an experimen- explore a number of issues of general interest to the infor-
tal psychologist will say that Psychiatry and Psychology mation retrieval community. We compare several different
are much more different than Management and Marketing, performance measures, methods for representing the
whereas someone in the business school would say the
domain, and methods for representing the reviewers' inter-
reverse [Furnas 1991]. It is thus likely that the hypertext
specialists on the Hypertext’91 review committee made ests.
much finer distinctions between various topics for papers
on hypertext than an “objective” outsider might have For use in calculating the precision of the review assign-
made. For some other conferences, any of the papers sub- ments and the number of relevant articles in the top- n
mitted for Hypertext’91 might just have been classified assigned, papers rated six or higher (“good match” and
under the general topic “hypertext,” and any of the Hyper- “right up my alley”) were considered “relevant” and
text’91 reviewers would have been considered among the papers rated five or lower were considered “not relevant”
best qualified to review them. for a particular reviewer. The mean number of relevant
papers across all reviewers is 47 or 40% of the total. This
The reported results are based only on the 15 reviewers for is a fairly high proportion of relevant material compared to
whom a complete list of relevance judgements was avail- many other information retrieval applications and may
able. The mean relevance rating across all papers and reflect the specialized nature of the Hypertext’91 confer-
reviewers is 4.7.
ence.*
These exhaustive relevance judgments serve as a bench-
mark against which performance is evaluated. The 117 Four main measures of performance were used as
submitted abstracts form the database of “documents.” The described below and shown in Table 2. Some measures
self-descriptive abstracts given to us by the 15 reviewers such as precision are commonly used in information
are the “queries” used by the automatic information retrieval research; other measures like mean rated rele-
retrieval methods. We use Latent Semantic Indexing (LSI) vance are appropriate for this particular application.
to compare each reviewer’s abstract with all the submis- Mostly we report the results of assigning ten papers per
sions, and rank the submissions from most to least similar reviewer, as that seems a common review load. The theo-
to each reviewer. These results are then compared with the
retical maximum for each measure is noted in parentheses
reviewers’ relevance judgments. Although we use LSI to
and is achieved by using the reviewers’ own ratings to
compare reviewers and submitted abstracts, this is a
straightforward matching problem and any other informa- rank the submitted articles.
tion retrieval method could have been used.
* We have also examined different cutoff values for “rele-
It is interesting to note that the “queries” used by the auto- vance.” When only papers receiving ratings of 8 or 9 are con-
matic retrieval method are somewhat different than those sidered as relevant, there are obviously many fewer relevant
used by the reviewers in making their relevance judg- papers (an average of 15 per reviewer or 12%), but the same
ments. While it is always the case that a judge’s internal general pattern of results are observed as those reported below.
representation of the query is not what is literally available The relative advantages of the automatic methods compared
to the automatic methods, this is especially true in the with random assignment are much larger when this more
present application, making it a difficult task for the auto- restrictive definition of “relevance” is used.

Automating the Assignment of Submitted Manuscripts to Reviewers 235


• Number of Relevant Articles in Top 10. This is the num- could also use existing domain knowledge to set more
ber of articles rated six or higher by the reviewers in the appropriate term weights—e.g., a large corpus of relevant
top ten articles assigned to them by the automated material (as compared with a relatively small number of
method. (For the ratings given by our reviewers, it submitted papers) could be used to more reliably establish
would in theory be possible to have an average of 10.0 appropriate term weights. At the other extreme, many AI-
relevant papers in the top-ten.) based methods depend very heavily on more conceptually-
• Precision. This is the mean precision (proportion of rel- based representations of domain knowledge, and more
evant papers) calculated at the 25%, 50%, and 75% lev- comprehensive coverage of the domain would almost cer-
els of recall. (Ideally, this number should be 1.00, tainly improve retrieval in such systems.
indicating that no irrelevant papers have been ranked
higher than any relevant paper.) The LSI method falls somewhere between these two
• Mean Relevance of Top 10. This measure indicates the extremes. The SVD analysis is based on the term by docu-
mean rated relevance for the first ten papers that were ment matrix. At a minimum, different domain representa-
assigned to each reviewer by our automated method. tions will result in different term weightings. In addition,
(For the relevance ratings actually given by the 15 the full term by document matrix will influence the orthog-
reviewers, the average top-ten rating is 8.20. That is, if onal factors obtained from the SVD analysis. The impor-
submitted abstracts were ranked by the reviewers’ own tant associative relations that the method extracts will
ratings, an average rating of 8.20 would have been depend on the pattern of term usage across documents. The
obtained for the first ten papers assigned to each nature and extent of this influence on the derived factors
reviewer.) and retrieval performance is an empirical issue. In the
• Mean Relevance of Top 5 & 5 More Picked from Top experiments described in this paper, we use several differ-
20. This measure is intended to simulate an alternative ent collections of materials about hypertext to automati-
review assignment method where the reviewers are sent cally generate different term by document matrices. These
a larger number of papers than they actually must matrices are automatically decomposed using SVD and the
review. To ensure that each paper is reviewed by a min- resulting k-dimensional representations used for matching.
imum number of reviewers, the automated assignment
method would still assign some specific papers for Available Datasets
review, but the reviewers would be free to fill the rest of
their review responsibilities by choosing an appropriate Table 1 summarizes some of the major characteristics of
number of additional papers from the papers they have the databases used in our experiments. As a baseline, we
been sent. Here, we have simulated sending each used the 117 abstracts submitted to the conference. We
reviewer twenty papers, with five review assignments also analyzed these submitted abstracts combined with the
being pre-specified and each reviewer selecting five 83 abstracts that our reviewers used in their self-descrip-
additional papers for review from the remaining fifteen. tions.
(The theoretical maximum value of this measure is the
same as for the mean top-ten rating, since the two Of more interest are analyses of the “pre-existing” data-
methods would assign the same papers if a perfect bases about hypertext. These collections represent a cur-
review assignment method were used.) rent and comprehensive coverage of the hypertext field and
might provide a better modeling of term dependence than
It is important to note that we would observe the theoreti- the conference abstracts alone. The LSI analyses of these
cal maximum for each measure of performance only if databases do not take very long (about 1–2 hours on a
reviewers were perfectly reliable and consistent in their workstation), but could be performed well before they are
relevance ratings. We know that this is generally not the needed for the assignment task.
case, so the maximum values represent an upper bound on
performance. We obtained the full text of three books on hypertext—
Nielsen’s Hypertext and Hypermedia, Rada’s HYPER-
Representing the Domain TEXT: From Text to Expertext, and Shneiderman & Kears-
ley’s Hypertext Hands-On! In addition, we analyzed the
We were fortunate to have several textual databases about union of these three books as a separate dataset. Each para-
the domain of the conference (hypertext) available to us in graph in the Nielsen book was treated as a separate “docu-
machine readable form. This enabled us to examine the ment” or unit of analysis. For the Rada and Shneiderman
extent to which a richer representation of relevant domain & Kearsley books, we used the hypertext nodes (as defined
knowledge could be used to improve the assignment of by the authors for the online versions of their books) as the
submitted papers to reviewers. unit of analysis. We did not make any explicit use of the
hypertext links.
Different information retrieval methods depend on domain
knowledge to different degrees. At one extreme, standard We also obtained the ACM compendium on hypertext
vector matching methods, for example, generally ignore [Akscyn 1992]. This consists of a collection of articles
available domain knowledge and simply match submitted about hypertext. In one analysis we used the full text of the
abstracts against reviewers. However, vector methods 128 available articles, treating each article as a unit of

236 Dumais and Nielsen


Number of Number of
Size Units Terms in
Dataset Name Dataset Contents
(Kbytes) (“Docu- More Than
ments”) One Unit

Abstracts of all the 117 papers submitted for the


Submitted Abstracts 137 117 1,180
Hypertext’91 conference

Abstracts of all the 117 submitted papers combined


Submitted and Review-
with 83 abstracts of papers written by the 25 208 200 1,661
ers’ Abstracts
reviewers

Full text of the book Hypertext and Hypermedia


Nielsen 655 1,249 3,337
[Nielsen 1990], split by paragraphs

Full text of the book HYPERTEXT: From Text to


Rada 517 775 3,274
Expertext [Rada 1991], split by nodes

Full text of the hypertext version of the book Hyper-


Shneiderman &
text Hands-On! [Shneiderman and Kearsley 1989], 299 252 2,426
Kearsley
split by nodes

The Nielsen, Rada, and Shneiderman & Kearsley


Three books 1,472 2,276 6,008
datasets combined

Full text of The ACM Hypertext Compendium


ACM Articles [Akscyn 1992], split by entire articles as written by 4,687 128 9,714
the original authors

Full text of The ACM Hypertext Compendium, split


ACM Frames 4,841 4,511 12,489
by hypertext nodes as defined by the editors

Bibliographic database of the human–computer


Perlman Bibliography interaction and hypertext literature, mostly including 1,147 1,725 5,267
abstracts

The combination of the Nielsen, Rada, Shneider-


All Pre-Existing
man & Kearsley, ACM Frames, and Perlman 7,460 8,629 15,625
Datasets (union)
datasets

Table 1 Summary of the datasets used to build the domain representation of the hypertext field.

analysis. These articles contain 4,511 hypertext nodes in cate journal or conference name, publisher, year, etc. On
the form of frames. Another analysis treated the 4,511 Nov. 26, 1991 this set contained 1,725 entries.
hypertext frames as “documents.” Note that the size
(Kbytes) of the ACM dataset is slightly larger when it is A final analysis used the union of all pre-existing datasets
split up by hypertext nodes. This is because we repeated (the three books, ACM Frames, and Perlman's bibliogra-
the title of the parent article in each of its nodes. This repe- phy).
tition ensures that nodes from the same article will have
some term overlap and thus a somewhat higher similarity. The size and coverage of these databases varies consider-
In principle, we could have taken advantage of the hyper- ably, from the .14 Mbyte collection of 117 submitted
text links in a similar manner to increase the surface-level abstracts to the 7.46 Mbyte collection of 8,629 documents
similarity for connected nodes in different articles. which represents the union of all pre-existing datasets. For
our experiments, each database was processed in a com-
Perlman’s HCI bibliography [Perlman 1991] is a more pletely automatic fashion. A term by document matrix was
general bibliographic database of articles about Human– generated for each collection. Cell entries represent the
Computer Interaction, many of which are directly about or frequency with which a given term occurs in a given docu-
relevant to the field of hypertext. This database contains ment multiplied by the entropy weight for that term. An
the titles, authors and usually the full text of the abstract SVD analysis was performed on this matrix and the k-larg-
for papers in relevant journals, conferences, and edited est singular values and their corresponding term and docu-
books. For many entries, additional fields are used to indi- ment vectors were used for retrieval. The number of

Automating the Assignment of Submitted Manuscripts to Reviewers 237


Number of Mean Rated Mean Relevance of Top
Dimensions
Dataset Name Relevant Articles Precision Relevance of 5 & 5 More Picked from
Used
in Top 10 Top 10 Top 20

50 5.7 .49 5.58 6.69


Submitted Abstracts
100 5.7 .51 5.57 6.67

Submitted and Reviewers’ 50 5.8 .49 5.66 6.93


Abstracts 100 5.5 .50 5.49 6.53

100 5.7 .51 5.76 6.73


Nielsen
500 6.3 .53 5.94 6.82

50 6.0 .51 5.80 6.67


Rada
100 6.1 .52 5.77 6.73

Shneiderman & 50 4.9 .46 5.30 6.13


Kearsley 100 4.9 .47 5.27 6.27

100 5.9 .52 5.79 6.74


Three books
250 6.3 .52 6.00 6.77

50 6.4 .53 5.95 6.89


ACM Articles
100 6.5 .50 6.06 6.84

100 5.9 .51 5.76 6.95


ACM Frames
200 6.0 .52 5.81 6.83

100 6.5 .53 5.95 7.00


Perlman Bibliography
200 6.4 .52 5.86 6.85

100 5.9 .51 5.70 6.74


All Datasets
230 6.3 .52 5.89 6.82

Mean of Above 5.9 .51 5.75 6.73

Random Assignment 4.0 .41 4.70 5.84

Result of the Perfect Match 10.0 1.00 8.20 8.20

Table 2 Performance of the automated review assignment with various domain representations and dimensions. All results
are based on matching the reviewers’ abstracts with the submitted abstracts and then normalizing the results for
those reviewers who supplied multiple abstracts.

dimensions, k, varied from 50 to 500 in the experiments In general, performance is well above chance; the results
reported here. are consistent for the four different measures of perfor-
mance; and there is somewhat less difference in perfor-
Results mance across the domain representations than we
Table 2 summarizes the results of some of our experi- expected. Averaged over all datasets and dimensions
ments. For each dataset, we present results from the four (some of which we know to be sub-optimal), the mean
different measures of performance described earlier. For number of relevant articles in the top 10 articles returned
each dataset, we present results from two different LSI by our automated method is 5.9. So, more than half of the
spaces, one with 100 dimensions (which previous research articles are rated as at least a “good match” given our rat-
indicates is a reasonable choice), and smaller or larger val- ing scale. This represents an advantage of 48% over what
ues than this depending on the dataset. As in previous would have been obtained with random assignment of arti-
experiments [Dumais 1991], we found a large initial cles to reviewers (an average of 4.0 relevant in the top 10).
increase in performance as the number of dimensions The average three-point precision value is .51. This repre-
increased, a peak at some intermediate value, and then a sents a 24% advantage over random assignment (for which
gradual decline to a level that corresponds to a standard precision was estimated using 100 different random order-
vector retrieval method. ings of articles). The average rated relevance of the top 10
returned articles is 5.75, where a rating of 6 or above indi-
cates a “good match.” This represents a 22% advantage

238 Dumais and Nielsen


over random assignment. The average relevance is even each reviewer is assigned a reasonable number of papers.
higher (6.73), when the 10 items assigned to each reviewer As mentioned in the introduction, a balanced assignment
are determined by a mixture of 5 preassigned articles and 5 can be derived from our data by adding a number of linear
reviewer-selected articles from the next 15. constraints. By using an efficient implementation of
Edmonds’ b-matching algorithm [Applegate and Cook
All performance measures are well above what would have 1992; Edmonds 1965], we were able to handle the con-
been obtained by random assignment, and, in some cases, straints of a balanced review assignment in 3–5 seconds on
are better than the assignments of human experts (see the a standard workstation. We tested the assignment of 2, 3,
following section on “Comparing the Automated Method and 4 reviewers per paper based on several different
to Human Review Assignment” for details). However, the datasets and found that the mean relevance rating was
automatic assignment is also far from optimal. reduced by 2.3%, 0.9%, and 0.7%, respectively, when tak-
ing the balancing constraints into consideration. Basically,
We should note that relevance judgments are not perfect it seems that the assignment of multiple reviewers per
indicators of the “true” relevance of the papers. First, they paper involves so many possible combinations that the
were based on abstracts only, and the abstracts were some- addition of constraints does not reduce the quality of the
what misleading in a few cases. Second, reviewers are solution much.
unable to judge their own interests with perfect consis-
tency. One of the reviewers completed a second relevance Representing the Reviewers’ Interests
judgment form about seven months after his initial rele-
vance judgments. The correlation between the ratings Reviewers are automatically represented as points in the k-
given by this same individual on these two occasions was dimensional LSI space. Recall that each reviewer
.76, which is certainly high enough to indicate better than described his/her interests by giving us one or more
random consistency (p<.001), but still low enough to indi- abstracts. The simplest use of the multiple abstracts is to
cate some uncertainty in estimating exact relevance values. concatenate them to a single file, but one can sometimes
improve the performance of matching methods by consid-
Generally speaking, our reliability results are consistent ering smaller focused components of large, heterogenous
with those reported by Lesk and Salton [1969] although texts [Salton and Buckley 1991]. We therefore examined
quite different reliability and performance measures are three methods of representing reviewers’ interests—com-
used. We observed relatively large differences in relevance bined, split-nonorm, split-norm. For the “combined”
assessments for the same observer on two different occa- method, each reviewer was represented as a single point.
sions (.76 correlation), but these differences did not result The location of that point was determined by taking the
in large differences in mean rated relevance (only 5%) weighted average of all the terms in all the abstracts given
when we used judgments obtained on one occasion to pre- to us by that reviewer. For the “split” methods, each
dict performance using those obtained on the other occa- reviewer was represented as many points, one correspond-
sion. ing to each of the abstracts they gave us (an average of 3.3
per reviewer). For both split methods, we compute the sim-
There is some variability in performance across the differ- ilarity between every abstract representing a reviewer and
ent datasets that we used to represent domain knowledge. every document. The two split methods differ only in how
For example, the average number of relevant articles in the the information from the multiple abstracts representing a
top 10 ranges from 4.9 (Shneiderman & Kearsley, 50 and given reviewer is combined to obtain a single ranking of
100 dimensions) to 6.5 (ACM Articles, 100 dimensions; the similarity between submitted abstracts and the
Perlman Bibliography, 100 dimensions). Overall, the best reviewer. In the “split-nonorm” method, we simply pick
performance is obtained with the Nielsen, ACM Articles, the largest cosine for each submitted abstract from the
and Perlman datasets, and the worst performance occurs cosines for the separate abstracts representing that
for the Shneiderman & Kearsley book. The variability reviewer. These cosines are then sorted to obtain the
across datasets is smaller than one might have expected, ranked list of reviewer to abstract similarities. In the
and does not appear to be related to the absolute size of the “split-norm” method, we first normalize the cosines for
dataset. Performance may be related to some other charac- each of the reviewers’ abstracts (i.e., transform the cosines
teristic of the datasets such as how representative they are so that they have a mean of 0 and a standard deviation of
of the hypertext field. Quite good performance is obtained 1), then proceed as described above.
simply by analyzing the small collection of submitted
abstracts. This may not be as surprising as it appears since Thus, for each reviewer we have three rankings of submit-
the abstracts span exactly the relevant range of topics. ted abstracts. In all cases, the similarity between each
However, it does suggest that additional effort to incorpo- reviewer and each submitted abstract is measured by the
rate domain knowledge (in so far as it is taken into account cosine between the point(s) representing the reviewer and
by LSI) may not always be necessary. the submitted abstract in LSI space. The resulting rankings
are used in our evaluations.
These results represent unconstrained review assignments,
where no attempt has been made to ensure that each paper For the present case study, there is not much difference in
is reviewed by a minimum number of reviewers or that the outcome of the review assignment whether reviewers

Automating the Assignment of Submitted Manuscripts to Reviewers 239


Representation of the
Number of Mean Rated Mean Relevance of Top
Reviewers Dimensions
Relevant Articles Precision Relevance of 5 & 5 More Picked from
(Matching based on Used
in Top 10 Top 10 Top 20
“All Datasets”)

100 5.9 .51 5.70 6.74


Reviewer’s Abstracts
230 6.3 .52 5.89 6.82

100 5.7 .48 5.73 6.74


Reviewer’s Name only
230 5.5 .49 5.62 6.63

Table 3 Performance of the automated review assignment method when “All Datasets” are used for the domain representa-
tions. The first row is based on matching the reviewers’ abstracts with the submitted abstracts and then normalizing
the results for those reviewers who supplied multiple abstracts. The second row is based on matching the reviewers’
family names with the submitted abstracts.

are represented by a single point (as the combination of application, the reviewers’ multiple abstracts were often
their abstracts) or as several separate points (as the individ- quite similar to each other. When the split representations
ual abstracts). Sometimes one method is slightly better, are not very different from each other or from the com-
and sometimes the other method is slightly better. Aver- bined representation, small performance differences are to
aged over the measures of performance shown in Table 2, be expected. In other applications such as information fil-
the split-norm method is 0.6% better than the combined tering where a single person may have several quite dispar-
method. ate interests, we believe that the advantages of the split
methods would be larger.
Not normalizing the ratings before combining the values
for each reviewers’ abstracts generates worse results in For practical applications, it may sometimes be difficult to
practically all the cases we have looked at. Averaged over gather machine-readable abstracts from all the members of
the measures of performance shown in Table 2, the split- a review committee. Therefore, we have also considered a
norm method is 3.5% better than the split-nonorm method. minimalist method of representing reviewers by their fam-
The data in Table 2 represent the use of the split-norm ily names and nothing else. This simple method is feasible
method which we chose due to its overall advantage over when the reviewers’ names occur as terms in the dataset
the combined and split-nonorm methods. used for the domain representation. This will often be the
case for research conferences where reviewers are mostly
people who have published extensively and are cited fre-
We believe that the advantage of normalizing arises
quently by other authors.
because it, in effect, gives equal weight to all reviewers’
abstracts. In some cases, the lengths of the reviewers’ Representing reviewers by their name can be done for 24
abstracts varied quite a bit. If the cosines are not normal- of the 25 reviewers (14 of the 15 for which full relevance
ized before combining, shorter abstracts will tend to have ratings were collected). Unfortunately, the name of one of
higher similarities with target documents—i.e., if two the reviewers did not occur as a term in the pre-existing
abstracts of different lengths have the same overlap (dot datasets. There are many ways of handling this problem.
product) with a document, the shorter abstract will have a We chose a simple and general method—we added some
higher cosine with the document. Thus, in cases where the text containing this reviewer’s name to the database. For
reviewers’ abstracts vary in length, shorter abstracts can our particular application, we added the reviewers’ self-
dominate the combined ranking for partly artifactual rea- descriptive abstract to the database. We then use the loca-
sons. By normalizing the cosines for each of the reviewers’ tion of the abstract to determine where the new term (the
abstracts before combining, these effects are minimized reviewer’s name in this case) is located. Since terms in the
and the distance in the LSI space between the query and LSI space are at the centroid of the documents which con-
submitted abstracts determines the combined ranking. tain them, this is roughly the same location that would
have been obtained if the database contained this abstract.
Although the performance differences between the split
and combined representations are not large in the present Table 3 shows the results of representing reviewers with
experiment, we have reason to believe that they would be their names only, when All Datasets are used to form the
larger and important in other contexts. Salton and Buckley domain representation. The results are somewhat worse
[1990] and Kane-Esrig et al. [1989] have suggested that than most of the analyses using the reviewers’ full
retrieval performance can be improved when heteroge- abstracts. The use of full abstracts is 4.3% better on aver-
neous texts are split into smaller units. The advantage of age than using names only (averaged over the performance
such an approach will be greatest when the original text is measures shown in Table 3). Review assignments using
quite heterogeneous (i.e., when the inter-item similarity of the reviewers’ names as the only representation of their
the smaller units of the split representation is low). In our interests is thus a feasible fall-back method if abstracts are

240 Dumais and Nielsen


Perfect Match of Reviewers’ Interests
9 n of 2n (All Datasets)
LSI (All Datasets)
LSI (Submitted Abstracts)
LSI (Shneiderman & Kearsley)
Random Review Assignments
8 Human Experts
Mean Relevance Rating

4
0 10 20 30 40
Number of Papers per Reviewer

Figure 1 Curves showing how the mean relevance ratings of the assigned papers drops as more papers are assigned to
each reviewer when using different datasets for the domain representation. All curves (except the top and bottom
curves) have been calculated with the number of LSI dimensions being equal to 100. The top curve shows the the-
oretically optimum relevance rating that would be achieved by perfectly matching the reviewers’ own ratings. The
n of 2n curve (stippled gray) shows the result of sending the reviewers twice as many papers as they are asked to
review and having them pick half of their review assignment themselves from among these papers. The black dots
indicate the performance of three humans experts who simulated the program chair’s job.

unavailable for some reviewers, but it is not recommended Unless otherwise noted, all results reported in this paper
as a general approach. represent the reviewers by several independent abstracts
for which the cosine similarity measures have been nor-
One obvious problem with representing the reviewers by malized before combining (the split-norm method).
their family name only is that several different individuals
may have the same name even though they have widely Varying the Number of Papers
differing research interests and review expertise. The Assigned to Each Reviewer
Hypertext’91 reviewers mostly had fairly rare family
names that to our knowledge are not shared by other active Figure 1 compares the mean relevance rating for various
researchers in the hypertext field. It is likely, however, that assignment methods as more and more papers are assigned
some name overlap would occur if the same method were to each reviewer. The figure has curves for three of the
to be used for broader fields. automatic LSI methods listed in Table 2 (All Datasets, the
117 submitted abstracts, and the Shneiderman & Kearsley
One can also represent reviewers by the keywords given book). Figure 1 also contains curves for the n of 2n method
for their papers rather than the full text of their abstracts. discussed further below, and for perfect matches and ran-
dom assignment.
Doing so gives slightly better performance than using the
reviewers’ names only, but using the full abstracts is still
4.1% better than using the keywords.

Automating the Assignment of Submitted Manuscripts to Reviewers 241


The perfect match curve indicates the results of following sons: A simulation of manual review assignments for the
the reviewers’ own ranking and is by definition monotoni- same submissions and reviewers used for the automated
cally non-increasing. In contrast, the curves representing methods, as well as the actual manual assignments from a
the various automatic review assignments sometimes con- different conference.*
tain bumps, indicating improved performance with the
assignment of an additional paper. These bumps reflect the In order to simulate manual assignment of the Hyper-
fact that the automatic methods are not perfect and some-
times will rank an extremely relevant paper fairly low. text’91 submissions, three human hypertext experts† inde-
When the review load for a reviewer is expanded to pendently read through all the abstracts and judged which
include such a highly rated paper, the mean relevance of of the members of the review committee would be appro-
the review assignment goes up slightly. priate as reviewers for each paper. No attempt was made to
balance the number of papers assigned to each reviewer or
The overall mean relevance rating for all the submitted the number of reviewers assigned to each paper. Therefore,
papers for all the reviewers is 4.7 as indicated by the lower the number of papers assigned to the individual reviewers
flat line in Figure 1. So all the automatic review assign- varied from 11 to 42 (mean 27.7) for the first expert, 17 to
ments do select considerably better than random paper 40 (mean 27.7) for the second expert, and 20 to 41 (mean
assignment regardless of how many papers are assigned to 29.7) for the third expert.
each reviewer (from 1 to 40).
For the Hypertext’91 conference, the members of the pro-
Figure 1 also shows the simulated results of an alternative gram committee actually reviewed an average of 25.8
review assignment method where the reviewers help papers each, so the 28–30 papers assigned by our human
choose their own review assignments. The method is experts come fairly close to reflecting the actual review
called “n of 2 n” and is based on sending each reviewer load for this conference. Most conferences would probably
twice as many papers as that reviewer is actually asked to place a smaller review load on their reviewers, however.
review. This is a generalization of the “Top 5 and 5 More”
method reported in Table 2. The 2n papers are simply cho- The manual assignment method assigns more papers to
sen as the top 2n papers for each reviewer, calculated by reviewers that are seen by the human expert as having
the automatic LSI method (here, we used All Datasets at broader coverage. Since the number of articles assigned to
100 dimensions). The top n/2 papers are assigned to the reviewers varies, one would expect somewhat different
reviewer. This ensures that each paper receives a minimum performance than that observed when just picking the top
number of reviews. The other n/2 papers to be reviewed 28 or 30 papers for each reviewer. In practice, however, the
are chosen by the reviewer from among the remaining 3/2 differences are very small (less than 1%). Roughly speak-
n papers sent to that reviewer. We believe that there is also ing, then, we can compare the automatic methods with the
a motivating effect in having reviewers choose part of their human experts by considering the performance of the auto-
own review load, but we do not have any empirical evi- matic methods at 28 and 30 papers per reviewer, respec-
dence to support this claim. The motivation of the review- tively, as shown in Figure 1. The mean performance of the
ers is important for conference program committees which three human experts is represented as points in Figure 1.
normally consist of unpaid volunteers.
Table 4 compares the mean rated relevance of the papers
As can be seen from Table 2, the mean rated relevance for assigned by the human experts with the relevance of the
a review load of ten papers goes up by about one rating same number of papers assigned by the automated meth-
unit (from 5.75 to 6.73, or 17%) when the reviewers are ods (the LSI method and the n of 2n method, both using All
sent twenty papers instead of ten and are allowed to choose Datasets at 100 dimensions). These results indicate that
five of their papers themselves. Figure 1 shows that the the simple LSI method is not as good as the best human
advantage of the n of 2n method is fairly constant except experts but that it performs in the same general range and
when very few papers are assigned to each reviewer.
* We could not use the actual assignments from Hypertext’91
The n of 2n curve must fall between the LSI All Datasets
curve and the perfect match curve because it represents a because the program chair used a mixture of automated and
manual methods to make the original assignments.
combination of these two assignment methods. As the † The first human expert is the editor of a respected informa-
number of articles assigned to each reviewer increases, the
n of 2n curve approaches the perfect match curve. tion systems journal which has published many hypertext
papers. This person is not a specialized hypertext expert to the
same degree as the two other experts, however. The second and
Comparing the Automated Method to third human experts were people who have served as program
Human Review Assignment chairs for a major hypertext and a major user interface confer-
ence, respectively, and are recognized as knowledgable in the
In addition to comparing the various automated review hypertext field. It is reasonable to expect all three experts to be
assignment methods to each other, we also compared them able to perform the task well, and the second and third experts
to the performance of human program chairs assigning are probably very close to achieving the optimum performance
reviews manually. This section presents two such compari- one could get from a human.

242 Dumais and Nielsen


Expert 1 Expert 2 Expert 3

Mean Number of Papers Assigned to each Reviewer 27.7 27.7 29.7

Mean Rating of Papers Assigned by This Expert 5.39 5.92 5.94

Mean Rating of Papers Assigned by the LSI Method (All Datasets, n=100) 5.33 5.33 5.31

Mean Rating of Papers Assigned by the n of 2n Method 6.35 6.35 6.34

Table 4 Comparison of the performance of human experts with two of the automated methods.

achieves the same performance as our Expert 1. As indi- In our Hypertext’91 case study, the reviewers’ mean rele-
cated by Table 4, the n of 2n method does achieve better vance rating was 6.15 for the top seven papers assigned to
performance than all the human experts. them by our automated review assignment based on All
Datasets and using 100 dimensions in the scaling. This
As noted above, our human experts created unbalanced number is certainly substantially lower than the 6.95 rating
review assignments since they were asked to find the best from the manual CHI’92 paper assignment but it is at least
reviewers for each paper without having to ensure that the in the same general range of the ten-point rating scale. Our
reviewers received the same review load or that each paper n of 2n method based on All Datasets and 100 dimensions
was reviewed by the same number of reviewers. It is thus achieves a 6.72 mean relevance for n=6 and 6.74 for n=8.
fair to compare the performance of the human experts with This is still lower than the 6.95 CHI’92 rating, but not
the performance of our unbalanced automated assignment. much so.
As noted above, it is possible to add constraints to the
automated methods fairly easily and to use an efficient Conclusions
optimization algorithm that satisfies the constraints. The
assignment generated in this way resulted in only a small It is possible to automate the assignment of submitted
decrease in mean relevance. In contrast, a human would manuscripts to relevant reviewers using information
probably apply the balancing constraints in a sub-optimal retrieval methods. When reviewers are sent only the manu-
manner since it is virtually impossible for a human to con- scripts they are asked to review, even the best automated
sider all the relevant combinations. We would thus expect methods perform somewhat worse than manual assign-
the relative performance of the automated methods to ments by human experts, but they do generate fairly good
improve compared to human experts when balancing con- results. A new method called n of 2n results in substan-
straints are also considered. tially higher mean ratings and is better than human experts.
This method involves sending reviewers twice as many
manuscripts as they are actually asked to review and
Another estimate of the performance of human experts
allowing them to choose half of their review assignment
assigning reviews can be derived from the ACM CHI’92
themselves from the papers they are sent (the other half is
conference on computer–human interaction. Admittedly, pre-assigned to guarantee coverage of all papers). This
this conference is different from the Hypertext’91 confer- method may also have some motivational advantage over
ence in many ways, so a direct comparison would not be traditional assignment methods where the reviewers have
reasonable. The CHI conference is somewhat more estab- no choice.
lished and covers a larger field, in terms of number of sub-
missions (312), number of reviewers (137), and scope of Although we applied the n of 2n method to the assignment
topics addressed by the conference. of submitted conference papers to reviewers, we believe
that it is also applicable to other cases where one has a
For the ACM CHI’92 conference, submissions were manu- large set of information objects that need to be matched
ally assigned to reviewers by the papers chair with the with a large set of humans such that each human gets
assistance of a small number of area specialists for certain assigned a small number of objects. A further condition for
sub-fields. After their meeting, 97 members of the papers the applicability of the method is the ability of the humans
committee for whom electronic mail addresses were avail- to quickly estimate the relevance of the information
able were asked to provide estimates of the appropriate- objects they must select from.
ness of having them review the papers they had been
assigned to review, using the same 0–9 scale used for the For the Hypertext’91 case study, the performance of the
Hypertext’91 reviewers. Replies were received from 45 automated review assignments was fairly independent of
reviewers, for a response rate of 46%. On average, the the dataset picked to generate the initial domain represen-
reviewers had been assigned 6.8 papers to review, and their tation. Of course, we only tested datasets that were rele-
mean relevance rating was 6.95. vant for the hypertext domain, and this field is still fairly

Automating the Assignment of Submitted Manuscripts to Reviewers 243


small and focused. One of the datasets did exhibit substan- References
tially poorer performance than the others, however, so it is
Akscyn, R. (Ed.) (1992). The ACM Hypertext Compendium.
worth some effort to try to avoid such datasets or to com-
ACM Press, New York, NY.
pensate by combining them with others.
Applegate, D., and Cook, W. (1992). Solving large-scale
We know in retrospect that the Shneiderman & Kearsley matching problems. Paper to appear.
book turned out to be comparatively poor as a basis for Cohen, P.R., and Kjeldsen, R. (1987). Information retrieval
forming a domain representation of hypertext in LSI. by constrained spreading activation in semantic net-
Combining several books does not improve performance works. Information Processing & Management 23, 4,
over using a single book that happens to be appropriate for 255–268.
the purpose, but one cannot know whether any given book Cullum, J.K., and Willoughby, R.A. (1985). Lanczos algo-
will in fact be appropriate without first collecting exhaus- rithms for large symmetric eigenvalue computations—
tive relevance feedback from the reviewers. Since it would Vol. 1 Theory, (Chapter 5: Real rectangular matrices).
seem hard—if not impossible—to determine in advance Brikhauser, Boston.
which books are appropriate as domain representations,
Deerwester, S., Dumais, S.T., Landauer, T.K., Furnas, G.W.,
the best advice is probably to rely on as large collections of
and Harshman, R.A. (1990). Indexing by latent seman-
books and other text as possible. tic analysis. Journal of the Society for Information
Science 41, 6, 391–407.
For the Hypertext’91 case study, we found that represent-
ing reviewers as several points corresponding to each of Dumais, S.T. (1991). Improving the retrieval of information
their self-described interests resulted in slightly better from external sources. Behavior Research Methods,
matching performance. We believe that the advantage of Instruments and Computers, 23, 2, 229–236.
split representations will be larger the more different a per- Edmonds, J. (1965). Maximum matching and a polyhedron
son’s areas of interest are from each other. Applications with 0,1 - vertices. Journal of Research of the National
such as information filtering (where a single person may Bureau of Standards 69B, 125–130.
have several disparate interests), or databases where que- Furnas, G.W. (1986). Generalized fisheye views. Proc.
ries and/or target documents are large and heterogeneous ACM CHI’86 (Boston, MA, 13–17 April), 16–23.
are most likely to benefit from such methods. Normaliza-
tion is important when combining information from multi- Furnas, G.W. (1991). Personal communication.
ple representations. Furnas, G.W., Deerwester, S., Dumais, S.T., Landauer, T.K.,
Harshman, R.A., Streeter, L.A., and Lochbaum, K.E.
(1988). Information retrieval using a singular value
decomposition model of latent semantic structure. In
Proceedings of SIGIR’88, 465–480.
Acknowledgments Kane-Esrig, Y., Casella, G., Dumais, S.T., and Streeter, L.A.
The authors would like to thank Ben Shneiderman, Greg (1989). Ranking documents for retrieval by modeling of
Kearsley, and Paul Hoffman for providing a machine-read- a relevance density. In Proceedings of the 12th IRIS
able version of the full text of the book Hypertext Hands- Conference (Information System Research Seminar In
On! and Roy Rada for providing a machine-readable ver- Scandinavia), Skagen, Denmark, August 1989.
sion of the full text of the book HYPERTEXT: From Text to
Expertext. We also thank Robert Akscyn and Elise Yoder for Lesk, M.E., and Salton, G. (1969). Relevance assessments
providing early access to the files of The ACM Hypertext and retrieval system evaluation. Information Storage
Compendium and Gary Perlman for making his extensive and Retrieval 4, 343–359.
HCI Bibliography Project collection of abstracts available. Nielsen, J. (1990). Hypertext and Hypermedia. Academic
Our analyses on the basis of the word usage in these manu- Press, San Diego, CA.
scripts are solely our responsibility and should not be taken
as necessarily corresponding to the positions of the authors Perlman, G. (1991). The HCI bibliography project, ACM
and editors of these publications. SIGCHI Bulletin 23, 3 (July), 15–20.
The authors thank the anonymous SIGIR referees, Tom Rada, R. (1991). HYPERTEXT: From Text to Expertext .
Landauer, Michael Littman, John Patterson, and Scott Stor- McGraw-Hill, London, U.K.
netta for helpful comments on earlier versions of this paper. Salton, G., and Buckley, C. (1991). Automatic text structur-
We also thank Bill Cook for helping us use his implementa- ing and retrieval—Experiments in automatic encyclo-
tion of an efficient matching algorithm for optimization pedia searching. Proc. ACM SIGIR’91 Conf. (Chicago,
under constraints. IL, 13–16 October), 21–30.
Finally, the authors thank several members of the Hyper- Salton, G., and McGill, M.J (1983). Introduction to Modern
text’91 program committee for letting themselves be volun- Information Retrieval. McGraw-Hill.
teered into providing us with exhaustive relevance ratings
of the submitted papers. We also thank the three human Shneiderman, B., and Kearsley, G. (1989). Hypertext
experts for helping us with the time consuming job of esti- Hands-On! An Introduction to a New Way of Organiz-
mating the match between each individual reviewer and ing and Accessing Information . Addison-Wesley,
each submitted paper. Reading, MA.

244 Dumais and Nielsen

Das könnte Ihnen auch gefallen