You are on page 1of 212

‫للتقويم التربوي واالختبارات التعليمية‬

for Educational Evaluation & Testing

Evaluation of Selected General Cognitive Ability Tests

Submitted to

Submitted to

Employment Selection & Assessment Program


Ministry of Civil Service

By

Sabbarah for Educational Evaluation and Testing

Prepared by

Selection and Assessment Consulting


Jerard F. Kehoe, Ph.D.
James Longabaugh, MA

August 26, 2013


TABLE OF CONTENTS

Executive Summary 3

Section 1: Purpose of Report 5

Section 2: Description of Report 6

Section 3: Objectives of Study 2 8

Section 4: Methodology of Study 2 10

Section 5: Descriptive Results 12


Differential Aptitude Test for Personnel and Career Assessment (DAT PCA) 12
Employee Aptitude Survey (EAS) 27
General Ability Test Battery (GATB) 53
Information about OPM’s New Civil Service Tests 77
Professional Employment Test (PET) 82
Power and Performance Measures (PPM) 91
Verify 105
Watson-Glaser Critical Thinking Appraisal 127

Section 6: Integration, Evaluation and Interpretation of Results 148


The Question of Validity 148
Test Construct Considerations 149
Item Content Considerations 155
Approaches to Item / Test Security and Methods of Delivery 159
Item and Test Development 161
Strategies for Item and Bank Management 166
Use of Test Results to Make Hiring Decisions 169
Considerations of Popularity, Professional Standards, and Relevance to MCS 172

Section 7: Recommendations and Suggestions 175


GCAT Plan and Specifications 175
Choosing Test Constructs 175
Specifying Item Content 178
Methods of Item and Test Development 182
Validation Strategies 184

Organizational and Operational Considerations 199


Security Strategy 199
Item Banking 200
Staff Requirements 201
User Support Materials 204
Guiding Policies 205

Section 8: References 209

2
EXECUTIVE SUMMARY
Background Information

The purpose of this Study 2 report is to review and evaluate selected tests of general cognitive ability
used for personnel selection in order to make recommendations to the Ministry of Civil Service (MCS)
about its development of a civil service test battery. Seven tests of general mental ability (GMA) were
selected for this review based on their expected relevance to MCS’s interests. The objective of Study
2 was to provide MCS with evaluative and instructive information about the development, validation
and use of GMA tests for large scale applications relevant to MCS’s planned General Cognitive Ability
Test (GCAT). The objective of Study 2 is to review and evaluate at seven existing GMA tests that
represent models or comparisons relevant to MCS’s plan for GCAT.

Selection and Assessment Consulting (SAC) reviewed and evaluated the batteries by gathering test-
related documents from publishers and from independent sources and by interviewing publishers’
testing experts for further detailed information.

Findings Relevant to MCS

What Do the Batteries Measure?

1. All batteries are comprised of between 4 and 10 subtests, each designed to measure a
different cognitive ability (excluding psychomotor subtests in GATB).
2. Other than Watson-Glaser which is designed to measure critical thinking skills,, nearly 80%
(39 of 49) of all subtests measure one of four core categories of cognitive ability – Verbal
Ability, Quantitative Ability, Reasoning Ability and Spatial/Mechanical Ability.
3. Other than psychomotor and speeded subtests, 70% (24 of 34) of all subtests contain
sufficient acquired knowledge content to be considered measures of crystallized ability.
4. Of the 24 subtests measuring crystallized ability, 18 (75%) are based on work-neutral item
content; 6 are based on work-like item content. No subtest is designed specifically to
measure acquired job knowledge.
5. Except for Watson-Glaser and PET, all batteries are designed to be applicable across a wide
range of job families and job levels. This is achieved largely through work-neutral item
content and moderately broad abilities that do not attempt to measure highly job-specific
abilities.

What Types of Validity Evidence Are Reported?

1. All batteries report some form of construct validity evidence, always in the form of correlations
with subtests in other, similar known batteries.
2. All batteries except PPM, report predictive validity evidence usually against job performance
criteria consisting of supervisory ratings of proficiency or other measures of training success.
3. No content validity evidence linking test content to job content is reported for any battery.

How Are Batteries Used?

1. All batteries except Verify are available for use in proctored settings with a small (1-3) number
of fixed forms that are modified only infrequently.
2. All batteries are available online in addition to a paper-pencil mode.
3. Verify and Watson-Glaser (UK) are the only batteries available online in an unproctored
setting. With both batteries, unproctored administration is support by IRT-based production
of randomized forms so that virtually all test takers receive a nearly unique, equivalent form.
4. For all batteries except Verify, scores are based on number-correct raw scores (corrected for
guessing in two batteries). Most batteries transform raw scores to percentile rank scores
based on occupation or applicant group norms.

3
Recommendations and Suggestions

Recommendations and suggestions are grouped into two categories – those relating to the
development of GCAT and those relating to the management of the civil service testing program.
Key recommendations are summarized here.

GCAT Development

1. It is recommended that MCS develop at least two subtests within each of the Verbal Ability,
Quantitative Ability and Reasoning Ability categories as those categories are exemplified by
most of the reviewed batteries (not Watson-Glaser). Modest job information should be
gathered to determine if additional subtests are needed for Processing Speed and Accuracy
and/ or for Spatial / Mechanical Ability.
2. Most, if not all, subtests should use item content associated with crystallized ability and, to the
extent possible, that content should be in a work-like context to promote user acceptance and
confidence.
3. Initial item and subtest development is likely to rely on classical test theory (CTT) for item
retention and subtest construction. Yet, MCS should plan to migrate to an IRT-based item
bank approach after implementation to prepare for the capability of producing many
equivalent forms.
4. Initial item and subtest development should be sized to produce 4-6 equivalent forms of each
subtest, with the exception of two forms of Processing Speed and Accuracy subtests. All
forms should be in use from the beginning to establish a reputation for strong security.
5. Item developers are encouraged to develop items within a subtest with moderately diverse
levels / difficulty / complexity around the level appropriate to the typical education level of the
applicants. Items of moderate diversity in level/difficulty/complexity will create eventual
opportunities using IRT-based forms production to tailor subtests to the level of the target job
family.

Program Management

1. Establish a proactive, well communicated security strategy from the beginning modeled after
many aspects of SHL’s strategy for supporting Verify. This should include multiple forms,
item refreshing, bank development, data forensics, web patrols and an “Honesty” agreement
applicants are required to sign.
2. As soon as possible, MCS should evolve to a forms production strategy that is IRT-based and
supported by large item banks. This approach is intended primarily to protect the integrity of
the testing program and is not intended to encourage unproctored administration.
Nevertheless, this approach would be capable of supporting unproctored administration, if
necessary. SHL’s Verify approach is a model that is adherent to current professional
standards.
3. Establish a significant research function (coupled with development) to provide the capability
to refine and improve the testing program over time as empirical data accumulates to support
validation efforts, to support the design of Phase 2, and to support investigations of program
effectiveness.
4. Staff the support organization with the expertise and experience required to manage an
effective personnel selection program. The operational leadership role should require
extensive experience and expertise in personnel selection technology and program
management.
5. Establish roles for managing relationships with key stakeholders including applicants, hiring
organizations, the Saudi Arabian public including schools and training centers, and Testing
Center staff.
6. Develop a full set of guiding policies to govern the manner in which applicants, hiring
organizations, and Test Center staff may utilize and participate in the testing program.

4
SECTION 1: PURPOSE OF REPORT
The purpose of this Study 2 report was to review and evaluate selected tests of general cognitive
ability developed to be used for personnel selection applications. Seven test batteries were selected
to help inform the Ministry of Civil Service (MCS) of the Kingdom of Saudi Arabia about the task of
developing a national General Cognitive Ability Test (GCAT). Selection and Assessment Consulting
(SAC) was requested by Sabbarah for Educational Testing and Evaluation (Sabbarah) to conduct
Study 2.

Study 2 is one of three studies planned to support MCS’s effort to develop GCAT for use in recruiting
and employing applicants in a wide range of civil service jobs. The overall purpose of these studies
is to provide MCS with evaluative and instructive information about the development, validation and
use of cognitive ability tests for large scale applications relevant to MCS’s planned GCAT. The
objective of Study 2 was to review and evaluate seven existing cognitive ability test batteries that
represent models or comparisons relevant to MCS’s plan for GCAT.

This review and evaluation provides

a) descriptive information including usage, modes of delivery, user requirements, and available
documentation,
b) technical information regarding scoring, measurement/psychometric models, reliability and
validity evidence, and information about the definitions of the tested abilities and methods for
developing the items comprising the subtests,
c) evaluative information for each test including its standing with respect to professional
standards and principles, and issues associated with translation and adaptation, and
d) suggestions and recommendations taken from the evaluative reviews regarding the
development of GCAT and implementation issues such as security maintenance, item
banking strategies, and test management issues relating to large scale testing programs.

The information, suggestions, and recommendations contained within this Study 2 report are meant to
assist test developers and designers, item writers and reviewers, and psychometricians in developing
plans and specifications for GCAT so that MCS may use the resulting cognitive ability battery as a
selection instrument across a wide range of civil service jobs. The descriptive information reported in
Section 5 about each of the seven reviewed batteries provides the basis for the evaluative and
integrative perspectives in Section 6 and provides part of the basis for the recommendations
presented in Section 7. The recommendations rely on the integration of information about the
selected batteries and SAC’s prior experience with large scale selection programs. Other information
that informed the interpretations and evaluations in Section 6 and the recommendations and
suggestions in Section 7 was derived from relevant published information that is informative to MCS,
as well as information that was gained from interviewing test publishers.

5
SECTION 2: DESCRIPTION OF REPORT
The descriptive and evaluative information about the selected batteries are provided in Sections 5 and
6, respectively. The recommendations are presented in Section 7. These three section constitute
the primary contributions of Study 2.

A. Section 5: Results, Overall Description and Uses


a. Provides descriptions of the seven selected cognitive ability batteries as a summary
of technical information and other test relevant information that was gathered. It is
intended as descriptive information of test content and technical characteristics from
the resources provided by the test publishers and alternate sources of test-relevant
information. No significant evaluative information is provided in this section, except
for summaries of evaluative reviews of the cognitive ability batteries previously
published. Evaluative information is contained within Section 6. Information has
been organized in a manner that it is clear and succinct for the reader. A complete
summary of all descriptive information is provided separately in Section 5 for each of
the seven reviewed batteries. Each of these battery descriptions is organized into
nine major subsections:

 Overall Description and Uses


 Administrative Details
 Construct-Content Information
 Item and Test Development
 Criterion Validity Evidence
 Approach to Item / Test Security
 Translations / Adaptions
 User Support Resources
 Evaluative Reviews

B. Section 6: Integration, Interpretation and Evaluation of Results


a. Section 6 integrates and evaluates the information in Section 5 about each battery
into overall observations and conclusions about this set of batteries. This section
follows the presentation of extensive information about each target battery by
integrating that information into comparative assessments of battery qualities,
interpreting the potential implications for MCS’s civil service test system and
evaluating the features of batteries as possible models for MCS civil service tests.
Section 6 is organized into eight subsections.

 The question of validity


 Test construct considerations
 Item content considerations
 Approaches to item / test security and methods of delivery
 Item and test development
 Strategies for item management / banking
 Use of test results to make hiring decisions
 Considerations of popularity, professional standards and relevance to MCS
interests

C. Section 7: Recommendations and Suggestions


a. This section provides recommendations and suggestions for MCS about
specifications and planning for a national GCAT. This section provides suggestions
about important organizational and operational considerations relating to
management, staffing, security, item banking, technical support, administration, and
the design of test guides, manuals, and reports. These recommendations are based
on multiple sources including the reviews of the seven batteries, SAC’s own
professional experience with personnel selection testing programs, and the
professional literature relating to the design and management of personnel selection
systems. This section provides these suggestions and recommendations organized
in the following manner:

6
i. GCAT Plan and Specifications
1. Choosing Test Constructs
2. Specifying Item Content
3. Methods of Item and Test Development
4. Validation Strategies
ii. Organizational and Operational Considerations
1. Security Strategy
2. Item Banking
3. Operational Staff Requirements
4. Design of User Materials
5. Strategies and Policies

7
SECTION 3: OBJECTIVES OF STUDY 2
Study 2 had several important objectives relating to reviewing and evaluating GMA tests.

 Provide detailed reviews of seven selected GMA tests used in employment selection. The
information relevant for inclusion in the reviews were to consist of the GMA tests’ purposes,
content dimensions, delivery platforms, type of scores, spread of uses, validity research,
technical quality, availability of documentation, and other applicable and relevant information.

 List cognitive ability tests ranked by their popularity of use and compliance with the principals
and professional standards for testing (e.g., APA, NCMA, AERA & SIOP).

 Identify the most important considerations to be taken from these tests with regard to the
potential GCAT specifications and plans.

 Discuss important considerations regarding test specifications and plans with respect to
needed resources, time, staffing, technical support, organization, management and
administration of GMA tests used in an employment context.

 Provide detailed descriptions and comparisons of the batteries in terms of

o Purpose of use,
o Targeted populations,
o Targeted jobs or occupations,
o Mode of delivery,
o Time limits,
o Type of score,
o Availability of test reviews,
o Quality of technical reports and user manuals,
o Ease of use and interpretations,
o Costs associated with test and related materials, and
o Publishers of the seven selected GMA tests.

 Provide technical descriptions and comparison of the batteries in terms of

o Content definition methodologies,


o Procedures for construct and content validity (if reported),
o Number and types of items for each subtest of the GMA tests,
o Psychometric model (whether IRT or CTT),
o Methods used in equating alternate forms, norming, and standardization,
o Total score and subtest scores,
o Test scale and methods of scaling,
o Reliability and validity of tests, and
o Practical aspects of selected GMA tests in terms of:
 Security strategies to minimize fraud during test development,
production, management, maintenance, and administration,
 Item banking and item management,
 Designs of test (according to mode of delivery),
 Designs of test guides, manuals, supplementary materials, and technical
reports, and
 Services provided to test users.

 Evaluate the batteries with respect to

o Evaluation of tests according to professional standards (e.g., APA, NCME & AERA,
and SIOP),
o Popularity and use of the seven selected GMA tests,
o Compliance with the principals and professional standards,
o Suggestions for translations and adaptations,
o Suggestions and recommendations for MCS’s development of GCAT plan and
specifications, and

8
o A set of suggestions and recommendations with respect to important consideration in
management, staffing, security, item banking, technical support, administration,
design of test guides, manuals and reports.

9
SECTION 4: METHODOLOGY OF STUDY 2
Study 2 used several tactics to gather the information (e.g., technical reports, interviews, alternate
resources, etc…) necessary to fulfill the requirements of the study. The first phase of Study 2
involved the identification of GMA tests by SAC that had some initial basis for early consideration as
possible Study 2 tests. These early tests were identified based on SAC’s own experience with such
tests in the personnel selection domain, searches through sources of information about such tests,
and early input from Sabbarah. This early search effort was guided by broad, high level
requirements that the tests be GMA tests, produced by reputable developer/publishers, with some
history of use for personnel selection across a range of job families.

Originally, SAC identified18 cognitive ability tests, which then received a more detailed evaluation
based on emerging guidelines and selection criteria, as well as input from Sabbarah. SAC provided a
short list of 11 batteries with strong recommendation for three batteries. From these 11, Sabbarah
identified the seven test batteries to be reviewed in Study 2. These are shown in Table 1 with their
subtests and publishers.

The criteria SAC developed for the final selection of batteries were developed during this initial battery
identification process. These criteria were based on a variety of considerations, including, primarily,
Sabbarah’s requirements of Study 2 and SAC’s own professional experience and judgment about
important test characteristics given the Study 2 objectives. As an example, one recommendation
criterion is availability of technical data. Other criteria, such as scope of cognitive abilities, became
better understood as communications with Sabbarah further clarified the likely model for the new civil
service exam in which a battery of several subtests would assess a range of cognitive abilities. Also,
Sabbarah suggested an important criterion relating to the level of test content. The majority of
candidates for Saudi Arabia civil service jobs are likely to have college degrees or some years of
college experience. Also, an important job family is professional/managerial work. As result, Study 2
will be useful if it includes at least some tests developed at reading levels and/or content complexity
appropriate to college educated candidates and professional/managerial jobs. It was from these
criteria that the seven GMA tests were chosen for inclusion in Study 2. The criteria used to inform
SAC’s recommendations for GMA tests to be included in Study 2 are shown here.

A. Reputable Developer/Publisher
a. Does the developer/publisher have a strong professional reputation for well-
developed and documented assessment products?
B. Sabbarah/MCS Interest
a. Has Sabbarah expressed an interest in a particular test?
C. Scope of Cognitive Abilities
a. Do the cognitive abilities assessed cover a wide range?
D. Availability of Data
a. Is information about the test readily available, such as technical manuals,
administrative documents, research reports, and the like?
E. In Current Use
a. Is the test in current use?
F. Complementarity
a. Does the test assess cognitive abilities that complement the specific abilities
assessed by other included tests?
G. Relevance to MCS Job Families
a. Are the cognitive abilities assessed by the test relevant to the job families for which
MCS’s new civil service exams will be used?
H. Level of Test Content
a. Is the reading level or complexity of content at a level appropriate to
managerial/professional jobs? (Needed for some but not all included tests.)
I. Special Considerations
a. Are there special features/capabilities of the test that will provide significant value to
Sabbarah that would be absent without this test in Study 2?
J. Overall Recommendation
a. Does the test have a High Value, Good Value, Marginal Value, or No Value?
Table 1 describes the seven selected batteries reviewed in Study 2.

10
Table 1. The seven batteries included for review and evaluation in Study 2.
GMA Test Battery Subtests Publisher
Differential Aptitude Test for 1. Space Relations Pearson
Personnel and Career 2. Abstract Reasoning
Assessment (DAT for PCA) 3. Language Usage
4. Mechanical Reasoning
5. Numerical Ability
6. Verbal Reasoning
Employee Aptitude Survey 1. EAS 1-Verbal Comprehension PSI
(EAS) 2. EAS 2-Numerical Ability
3. EAS 3-Visual Pursuit
4. EAS 4-Visual Speed and Accuracy
5. EAS 5-Space Visualization
6. EAS 6-Numerical Reasoning
7. EAS 7-Verbal Reasoning
8. EAS 8-Word Fluency
9. EAS 9-Manual Speed and
Accuracy
10. EAS 10-Symbolic Reasoning
General Aptitude Test Battery 1. Name Comparison United States
(GATB Forms E & F) 2. Computation Employment Service
3. Three-dimensional Space (USES)
4. Vocabulary
5. Object Matching
6. Arithmetic Reasoning
7. Mark Making
8. Place
9. Turn
10. Assemble
11. Disassemble
Professional Employment Test 1. Data Interpretation PSI
(PET) 2. Reasoning
3. Quantitative Problem Solving
4. Reading Comprehension
Power and Performance 1. Applied Power Hogrefe
Measures (PPM) 2. Mechanical Understanding
3. Numerical Computation
4. Numerical Reasoning
5. Perceptual Reasoning
6. Processing Speed
7. Spatial Ability
8. Verbal Comprehension
9. Verbal Reasoning
Verify 1. Verbal Reasoning SHL
2. Numerical Reasoning
3. Inductive Reasoning
4. Mechanical Comprehension
5. Checking
6. Calculation
7. Reading Comprehension
8. Spatial Ability
9. Deductive Reasoning
Watson-Glaser II 1. Recognize Assumptions Pearson
2. Evaluate Arguments
3. Draw Conclusions

After the seven batteries were selected, SAC gathered technical manuals and other relevant sources
of information related to each of the batteries. These sources included documents and information
such as technical reports, sample materials, administration materials, user manuals, interpretive
guides, and published research studies. In addition to these types of documents, SAC also
recognized the need to interview publishers’ technical experts to gather detailed information that is
generally not available in published material. This information included topics such as item/test
11
development, psychometrics modeling, validity support, development of alternate forms, item banking
strategies, and item/test security to name a few topics. SAC was able to interview publishers’ experts
for all batteries except GATB and Verify. This was a not a significant setback for GATB because far
more published information is available about GATB than the other batteries. This was a modest
setback for Verify, however, because in our view Verify is the best model in many respects for GCAT.
Further, SHL declined to approve our effort to gather protected documentation about Verify such as a
current technical manual and technical information about their IRT-based bank management process.
SHL’s reason for not approving our access to protected information was that they were concerned
about disclosure of their intellectual property. The information gathered from multiple sources was
then disseminated into Section 5 which details information regarding each of the seven selected GMA
tests.

In addition to the publisher expert interviews, SAC also interviewed Principals at Sabbarah along with
the Study Reviewer. This particular interview provided significant clarification about the overall
development plans and use of GCAT for use as a selection instrument across a wide range of civil
service jobs in the Kingdom of Saudi Arabia. From this particular interview, the scope of the project
was slightly adjusted and modified to better complement the proposed development and uses of
GCAT by MCS.

Section 5 summarizes all descriptive information that was gathered for each of the seven batteries
using the same organization of information for all batteries. This information informed Section 6 and
Section 7 of this Study 2 report. The primary focus of Section 6 was to identify key similarities and
differences among the reviewed batteries to identify those features that are likely to be most relevant
and least relevant to MCS’s effort to develop GCAT. Section 6 placed high importance on the issues
of test constructs and item content because these early decisions about GCAT will dictate many of the
subsequent technical considerations. Section 7 presents recommendations about decisions and
approaches to the development, validation, implementation and ongoing maintenance of the civil
service testing system. While the battery-specific information in Sections 7 and 8 provided significant
input into SECTION 7, the recommendations were also based on SAC’s professional experience and
other published information about large scale personnel selection programs.

DESCRIPTIVE RESULTS

SECTION 5: DESCRIPTIVE RESULTS FOR DAT for PCA


Overall Description and Uses

Introduction

The Differential Aptitude Test for Personnel and Career Assessment (DAT for PCA) is a cognitive
ability battery composed of eight subtests intended to measure four abilities.

The first edition of the DAT was originally published in 1947, and since then it has been widely used
for educational placement and career guidance of students. It has also been used extensively for
occupational assessments of adults and post-secondary students. The DAT for PCA was created
specifically for assessing adults who are applicants for employment, candidates for training and
development, career guidance, and adult education students. The DAT for PCA is an adapted short
form of the DAT Form V that uses the same test items to measure the same DAT aptitudes. Verbal
Reasoning was designed to measure the ability to understand concepts framed in words, to think
constructively, find commonalities among different concepts, and to manipulate ideas at an abstract
level. Numerical Ability was designed to measure an individual’s understanding of numerical
relationships and capability to handle numerical concepts. Abstract Reasoning was designed to
assess reasoning ability such as perceiving relationships that exist in abstract patterns. Mechanical
Reasoning was developed to measure knowledge of basic mechanical principles, tools, and motions.
Space Relations was designed to assess a person’s ability to visualize a three-dimensional object
from a two-dimensional pattern, and to visualize how the pattern would look if rotated. Spelling and
Language Usage was designed to measure how well individuals spell common English words. The

12
Clerical Speed and Accuracy subtest was designed to assess a person’s processing speed in simple
perceptual tasks.

Note: The DAT for PCA is a shortened form of the DAT Form V. Therefore, the DAT for PCA and
DAT may be used interchangeably throughout this document. However, we have just learned that
two subtests, Spelling and Clerical Speed and Accuracy, were recently removed from the DAT battery
due to a low volume of use. We have decided to continue to include these two subtests in the
remaining portion of this DAT Report as we feel they still provide examples of subtests relevant to
administrative/clerical jobs in Saudi Arabia’s service sector of jobs.

Purpose of Use

DAT for PCA was designed specifically for assessing adults who are applicants for employment,
candidates for training and development, career guidance, and adult education students. Other
versions of the DAT are also used for educational placement and career guidance of students. The
abilities that DAT was designed to assess are general reasoning abilities, mechanical operations and
principles, verbal achievement, and clerical speed abilities. These abilities have been seen to be
important for a broad spectrum of applications in personnel selection and career guidance. Each of
the subtests was developed separately from one another and intended to measure distinct abilities.
Much of the item content in the DAT tests is a mix of school-related topics, experience, verbal and
nonverbal analogical reasoning, nonverbal spatial ability, and perception. DAT for PCA was
developed to shorten the overall test length compared to DAT Form V and other earlier versions,
improve content appropriateness, and enhance the ease of local scoring.

Target Populations

The populations targeted by the DAT are adults who are applicants for employment, candidates for
training and development, career guidance, and adult education students. These adults may be
applying for or working in a wide range of occupations such as professional, managerial, clerical,
engineering, military, administrative, technical services, skilled trades, and even unskilled trades.
Normative information is presented in the technical manual and supporting norm documents. Norms
are for gender (i.e., males, females, and both sexes), and are presented in percentiles as well as
stanines.

Target Jobs / Occupations

The DAT was designed to be used for a wide range of occupations such as professional, managerial,
clerical, engineering, military, administrative, technical services, skilled trades, and even unskilled
trades. It may be tailored to specific occupations by administering subtests in various combinations.
The technical manual reports summary statistical information for several specific occupational groups.

Spread of Uses

The DAT was designed as test of cognitive abilities primarily for assessing adults who are applicants
for employment, candidates for training and development, career guidance, and adult education
students. However, other versions of the DAT have been used for different purposes and for other
populations. For example, one version of the DAT is primarily used with primary school students as
an instrument for career guidance. The DAT for PCA is designed solely for adult personnel selection
and vocational planning. It was designed to be used across a wide range of occupations. The norm
tables in the technical manual are related to gender, but other norm tables are available for student
grade levels (e.g., students in year 11, A Level students, and students with higher levels of education)
and for job families such as professional and managerial. The DAT is scored with number-correct,
raw score totals for each subtest. Total scores may be tailored to different combinations of subtests.

Administrative Details

Administrative detail is summarized in Table 2 and briefly described below.

13
Table 2. Administrative features of the DAT for PCA.

Subtest
# Time
Scoring Rule Methods of Delivery
Items Limit

Space Relations 15 # Correct Raw Score. Paper-pencil, proctored;


35
minutes Percentile scores and Online, proctored.
stanines reported
Abstract 15 based on selected
30
Reasoning minutes norm groups.
Language Usage 12
30
minutes

Mechanical 20
45
Reasoning minutes

Numerical Ability 20
25
minutes

Verbal 20
30
Reasoning minutes

Spelling 55 6 minutes

Clerical Speed &


100 6 minutes
Accuracy

Time Limits

Compared to other versions of the DAT, the DAT for PCA subtests have fewer items and shorter time
limits. The developers shortened it to encourage and facilitate use of the battery for adult assessment
applications. The shortened time limits and fewer numbers of items are the only differences. The
target constructs and item content remained the same. Note: This shortening did not apply to the
speeded Clerical Speed and Accuracy subtest before it was removed from the battery.

Number and Type of Items


Table 2 shows the number of items for each of the eight subtests comprising the DAT for PCA,
ranging from 25 to 100 items. It appears that much of the item content of the DAT tests are a mix of
school-related topics, experience, verbal and nonverbal analogical reasoning, nonverbal spatial
ability, and perception. Three of the subtests measure some form of reasoning (e.g., the Abstract
Reasoning subtest is nonverbal), three are based on verbal content, and one each for numerical,
spatial ability, and clerical speed. The reading level is specified at no greater than grade six.
Therefore, this allows the DAT for PCA cognitive ability battery to be used across a very wide range of
jobs except higher level jobs for which job complexity and applicant education levels warranted higher
reading levels.

Type of Score

Number-correct subtest scores are added together to produce an overall total score and is compared
to the total possible score. Raw scores can also be compared to norm group tables to produce
percentile scores and stanines. The DAT for PCA was designed specifically for local scoring by the
user; both hand scoring and Ready-Score self-scoring options are available.

Test Scale and Methods of Scaling

Norm group standards allow percentile and stanine scores to be produced from raw score totals for
each of the eight subtests and for total scores. Each of the versions of the DAT can be used by
tailoring combinations of subtests to specific occupations and jobs.

14
Method of Delivery

The fixed forms of the DAT for PCA battery can be administered either by paper-pencil or by
computer and Pearson requires that both methods of delivery be proctored. The number of items,
types of items, and time limits are the same in both modes of administration. Scoring is the same for
both methods.

Cost

Tables 3 shows the current prices published for the DAT for PCA administrative materials. No pricing
is published for online administrations.

Table 3. Published prices for DAT for PCA administrative materials.


Item Unit Cost
Individual Test Booklets
Verbal Reasoning - Pkg. of 25 $190.00
Numerical Ability - Pkg. of 25 $190.00
Abstract Reasoning - Pkg. of 25 $190.00
Mechanical Reasoning - Pkg. of 25 $190.00
Space Relations - Pkg. of 25 $190.00
Language Usage - Pkg. of 25 $190.00
Answer Documents
Verbal Reasoning - Pkg. of 50 $147.00
Numerical Ability - Pkg. of 50 $147.00
Abstract Reasoning - Pkg. of 50 $147.00
Mechanical Reasoning - Pkg. of 50 $147.00
Space Relations - Pkg. of 50 $147.00
Language Usage - Pkg. of 50 $147.00
Directions for Administration and Scoring
Directions for Administration and Scoring - Pkg.
of One $22.00
Scoring Key - Single Copy
Verbal Reasoning $96.00
Numerical Ability $96.00
Abstract Reasoning $96.00
Mechanical Reasoning $96.00
Space Relations $96.00
Language Usage $96.00
*No cost information is available for Spelling or Clerical Speed and Accuracy subtests as they were
recently removed from the batter.

Construct – Content Information

Intended Constructs

The subtests that comprise the DAT for PCA are typical of other multiaptitude cognitive ability
batteries. The DAT for PCA includes eight subtests which measure four major abilities. Three of the
subtests are designed to measure reasoning (e.g., the Abstract Reasoning subtest is nonverbal),
three are verbal, one, numerical, one for spatial abilities, and one for clerical speed. Like many other
commonly used cognitive batteries DAT subtests may be combined in a manner that is tailored to
specific occupations and jobs.

The DAT developer intended to design a multiaptitude cognitive ability battery to measure an
individual’s ability to learn or succeed in various work and school domains. The original DAT
developed in 1947 was used primarily for educational placement and career guidance for students in
grades eight through twelve. However, it was later adapted to be used with adults for personnel
selection and career assessment. In order to use it as such, the DAT for PCA was shortened (i.e.,
15
both number of items and time limits), the tests were repackaged, and it was enhanced for the ease of
local scoring.

Table 4 below shows the eight subtests comprising the battery and the four abilities it is intended to
measure.

Table 4. Subtests of the DAT for PCA and the abilities it is intended to assess.
Ability
Mechanical
General Operations & Verbal
Subtest Reasoning Principles Achievement Clerical Speed
Verbal
X
Reasoning
Numerical Ability
X
Abstract
X
Reasoning
Mechanical
X
Reasoning
Space Relations
X
Spelling
X
Language Usage
X
Clerical Speed &
X
Accuracy

Item Content in the Subtests

Item content for each subtest is described here. The reading level is specified at no higher than
grade six, allowing the battery to be used across a wide range of occupations.

Verbal Reasoning (VR)

a. Measures the ability to understand concepts framed in words, to think constructively,


find commonalities among different concepts, and to manipulate ideas at an abstract
level. The items can test the examinee’s knowledge and also their ability to abstract
and generalize relationships from their knowledge.
i. The VR test consists of analogies that are double-ended in which both the
first and last terms are missing. The participant then must choose from five
alternative pairs to determine the best one pair that best completes the
analogy. Content of the items can be varied from multiple subject areas.
Rather than focusing on vocabulary recognition, the analogies instead focus
on the ability to infer the relationship between the first pair of words and apply
that relationship to a second pair of words.
1. Pearson notes that VR may be expected to predict future success in
occupations such as business, law, education, journalism, and the
sciences

16
2. Pearson notes that VR may be expected to predict future success in
occupations such as business, law, education, journalism, and the
sciences.

B. Numerical Ability (NA)


a. Measures an individual’s understanding of numerical relationships and capability to
handle numerical concepts.
i. Designed to avoid language elements to avoid language elements of usual
arithmetic reasoning problems in which reading ability may affect the
outcome. Being so, items are purely a measure of numerical ability. The
examinee must perceive the difference between numerical forms. In order to
assess ensure that reasoning rather than computation facility is stressed, the
computational level of the items is below the grade level of examinees that
the test is intended for.
1. Pearson notes that NA may be an important predictor of success in
occupations such as mathematics, physics, chemistry, and
engineering. It may also be important for jobs such as bookkeeper,
statistician, and tool making, etc…

C. Abstract Reasoning (AR)


a. Assess reasoning ability such as perceiving relationships that exist in abstract
patterns. It is a non-verbal measure of reasoning ability. It assesses how well
examinees can reason with geometric figures or designs.
i. Each of the AR items requires the examinee to use perception of operating
principles in a series of changing diagrams. They must discover the
principles governing the changes in the diagrams by designating which of the
optional diagrams should logically follow.
17
1. Pearson notes that AR may predict success in fields such as
mathematics, computer programming, drafting, and automobile
repair.

D. Mechanical Reasoning (MR)


a. Measures basic mechanical principles, tools, and motions.
i. Each of the MR test items consist of a pictorially presented mechanical
situation and questions that is worded simply. These items are
representative of simple principles that involved reasoning.
1. Pearson notes that MR may be expected to predict future job
performance in occupations such as carpenter, mechanic,
maintenance person, and assembler.

E. Space Relations (SR)


a. Assesses a person’s ability to visualize a three-dimensional object from a two-
dimensional pattern, and be to visualize how the pattern would look if rotated.
i. Each SR test item presents one pattern, which is followed by four three-
dimensional figures. The participant must choose the one figure that can be
created from the pattern. The test patterns are purported to be large and
clear. Basically, the task is to judge how the objects would look if constructed
and rotated.
1. Pearson notes that SR may be predictive of success in occupations
such as drafting, clothing design, architecture, art, decorating,
carpentry, and dentistry.

18
F. Spelling (SP)
a. Measure of how well individuals can spell common English words.
i. The examinee is presented with a list of words in which they must determine
which of those words are correctly spelled and which are misspelled.
Misspelled words are considered to be common and plausible spelling errors.
1. Pearson notes that SP may be predictive of future job performance in
occupations such as stenography, journalism, and advertising,
among nearly all occupations where use of the English language is a
necessity.
2. No sample item available.

G. Language Usage (LU)


a. Items are representative of present-day formal writing. Examinees must distinguish
between correct English language usage and incorrect usage.
i. Measures the ability to detect errors in grammar, punctuation, and
capitalization.
1. Pearson notes that LU may be predictive of future job performance in
occupations such as stenography, journalism, and advertising,
among nearly all occupations where use of the English language is a
necessity.

19
Clerical Speed and Accuracy (CSA)

b. Assesses a person’s response speed in simple perceptual tasks.


i. Examinees must first select the combination that is marked in the test
booklet, then keep in mind while searching the same combination in a group
of similar combinations on an answer sheet, and then selecting the correct
combination.
1. Pearson notes that CSA may be important in occupations such as
filing, coding, stock room work, and other technical and scientific data
roles.
2. No sample item available.

Combinations of Subtests

The technical manual reports several combinations of the eight subtests that can be combined to
create composites for measuring specific abilities. These are:

A. VR + NA
a. Measures functions associated with general cognitive ability
i. This combination of subtests reflects an ability to learn in an occupational or
scholastic environment, especially from manuals, trainers, teachers, and
mentors.

B. MR + SR + AR
a. Measure components of perceptual ability. This may be seen as the ability to
visualize concrete objects and manipulate the visualizations, recognizing everyday
physical forces and principles, and also reasoning and learning deriving from a non-
verbal medium.
i. Abilities of this type are important in dealing with things, not necessarily
people or words. Could be used in skilled trade occupations.

C. CSA + SP + LU
a. Abilities that represent a set of skills which may be necessary for a wide range of
office work.
i. These abilities may be useful for jobs relating to clerical or secretary
positions.

Construct Validity Evidence

The publisher reported correlations between the DAT for PCA subtests and other subtests included in
other commercially available cognitive ability batteries such as the General Aptitude Test Battery
(GATB) and the Armed Services Vocational Aptitude Battery (ASVAB), which were at one time the
two most widely used multiple aptitude test batteries in the US. The GATB had been widely used in
personnel assessment and selection by state employment services. The pattern of correlations in
Table 5 below provides support for the DAT: (1) The DAT battery is highly related to the six GATB
cognitive factors, (2) all of the DAT subtests, except for the Clerical Speed and Accuracy, were highly
related with the GATB’s general intelligence factor, (3) each of the DAT subtests has its highest
correlation with the appropriate GATB factor, and (4) the DAT’s Clerical Speed and Accuracy subtest
correlated relatively high with the GATB perceptual tests and motor tests.

20
Table 5. Construct validity correlations between DAT and GATB subtests.
VR
+
VR NA NA AR CSA MR SR SP LU G V N S P
NA .73
VR +
- -
NA
AR .63 .70 .71
CSA .41 .44 .46 .35
MR .68 .63 .71 .69 .27
SR .68 .67 .72 .70 .38 .72
SP .68 .57 .68 .44 .40 .35 .40
LU .81 .66 .80 .51 .44 .55 .53 .75
G .78 .72 .81 .64 .48 .62 .64 .70 .80
V .76 .64 .76 .58 .46 .57 .55 .68 .81 .94
N .52 .62 .61 .43 .48 .29 .33 .64 .58 .66 .54
S .53 .58 .59 .63 .42 .58 .68 .40 .45 .70 .57 .41
P .19 .23 .22 .24 .36 .19 .27 .24 .19 .37 .28 .35 .49
Q .40 .42 .44 .39 .61 .21 .29 .50 .39 .57 .51 .62 .46 .49

The DAT was also correlated with the ASVAB, which has been used in personnel selection, high
school career counseling, and two major Department of Defense programs. The ASVAB Form 14
contains 10 subtests; (a) General Science, (b) Arithmetic Reasoning, (c) Word Knowledge, (d)
Paragraph Comprehension, (e) Numerical Operations, (f) Coding Speed, (g) Automotive and Shop
Information, (h) Mathematics Knowledge, (i) Mechanical Comprehension, and (j) Electronics
Information. Eight of the tests are power tests, and the other two are speeded tests. This study found
that (1) overall, the DAT is highly correlated with the ASVAB, (2) except for the Clerical Speed and
Accuracy DAT subtest, all of the DAT subtests were moderately to highly correlated with all of the
ASVAB’s power tests, (3) DAT’s Clerical Speed and Accuracy test was only correlated with the
ASVAB’s speeded tests, and (4) the DAT composite of Verbal Reasoning and Numerical Ability was
highly related to ASVAB’s school-related tests. Table 6 provides the results of the study.

21
Table 6. Construct validity correlations between DAT and ASVAB subtests.
VR NA VR+NA AR CSA MR SR SP LU GS ARITH WK PC NO CS AS MK MC

NA .75

VR+NA - -

AR .69 .75 -

CSA .11 .22 - .17

MR .62 .55 - .58 .08

SR .66 .62 - .61 .09 .71

SP .56 .56 - .46 .19 .50 .48

LU .76 .67 - .62 .11 .64 .64 .73

GS .72 .64 .73 .58 .05 .66 .61 .53 .68

ARITH .75 .79 .82 .65 .10 .62 .66 .54 .67 .72

WK .78 .67 .78 .62 .04 .63 .59 .60 .76 .82 .73

PC .72 .66 .74 .62 .07 .60 .59 .57 .72 .72 .71 .80

NO .23 .41 .33 .30 .35 .21 .27 .36 .28 .27 .35 .28 .29

CS .22 .35 .30 .28 .42 .12 .19 .36 .26 .16 .26 .20 .26 .58

AS .47 .40 .47 .39 -.03 .63 .50 .27 .39` .60 .52 .57 .49 .15 .04

MK .73 .78 .80 .66 .13 .58 .67 .54 .68 .67 .80 .68 .69 .34 .27 .45

MC .61 .57 .63 .57 .03 .73 .66 .39 .55 .66 .65 .63 .62 .21 .14 .66 .63

EI .48 .42 .49 .40 -.01 .59 .50 .35 .48 .59 .53 .57 .52 .14 .07 .66 .50 .68

Item and Test Development

Item Development

The DAT for PCA is a shortened form the DAT Form V with fewer items and shorter time limits, except
for the Clerical Speed and Accuracy test which remained the same before it was removed from DAT
for PCA. In order to shorten the DAT for PCA, the developer begin with a theoretical analysis of the
psychometric effect on reliability resulting from shortening tests. In order to estimate reliability of the
shorter tests comprising the DAT for PCA, the Spearman-Brown formula was applied to the known
internal reliability coefficients for DAT Form V. From these values Pearson was able to determine
how much the original DAT tests could be shorten while retaining acceptable reliability coefficients.
Because DAT Form V is among the most reliable of all commercially available cognitive ability
selection batteries we are aware of, Pearson found that the DAT Form V tests could be shortened
significantly causing inadequate reliability. Table 7 reports these reliabilities for DAT Form V and DAT
for PCA.

22
Table 7. Observed reliability coefficients for DAT Form V and estimated reliability coefficients for DAT
for PCA.
DAT Subtest Form V (Observed) DAT for PCA (Estimated)
Verbal Reasoning .94 .90
Numerical Ability .92 .88
Abstract Reasoning .94 .90
Mechanical Reasoning .94 .90
Space Relations .95 .92
Spelling .96 .93
Language Usage .92 .87

The developer’s intent was to create items that would produce an overall test difficulty level
appropriate for adults and students in post-secondary training programs. “Target difficulty” of test
items shown in Table 8 is the average proportion of correct responses to a test’s items in the intended
population, adjusted for the number of response alternatives.

Table 8. Target difficulty for DAT for PCA subtests.


DAT Subtest Response Alternatives Target Difficulty
Verbal Reasoning 5 .60
Numerical Ability 5 .60
Abstract Reasoning 5 .60
Mechanical Reasoning 3 .67
Space Relations 4 .62
Spelling 2 .75
Language Usage 5 .60
*
Response alternatives are the number of alternatives from which examinees may choose on each
item.
Items were also adapted for the DAT for PCA in order to enhance cultural transparency. Some of the
previous items in the DAT were found to not be appropriate in other English-speaking countries. This
may have been due to language differences, spelling conventions, systems of measurement, and
cultural references. Care was also taken to avoid increasing gender differences that were known in
other versions of the DAT.

The developers used two sets of data to assemble the DAT for PCA, an initial try out sample and a
subsequent pilot sample. The first sample was used to select items into the DAT for PCA while the
second sample was used for subsequent item analysis and test statistics of the DAT for PCA
subtests. The first sample consisted of item analysis statistics (i.e., item difficulty and discriminating
power), for students in Grade 12 of the 1982 Form V standardization sample (Data available for both
males and females, separately and combined; sample size not reported). The second sample
consisted of a spaced sample of 1,512 males in the standardization sample, spanning Grades 10
through 12. They used these data sets to select DAT Form V items for use in the DAT for PCA, and
also to evaluate the psychometric properties of the shortened DAT for PCA compared to the DAT
Form V. Table 9 shows the internal consistency estimates provided by the second data set.

Table 9. Internal consistency estimates for the DAT for PCA subtests.
Subtest KR20 M SD SEM
Verbal Reasoning .91 15.7 7.7 2.3
Numerical Ability .88 13.9 6.0 2.1
Abstract Reasoning .91 20.0 7.2 2.2
Mechanical Reasoning .91 32.5 8.5 2.6
Space Relations .93 22.0 9.0 2.4
Spelling .94 38.8 11.0 2.9
Language Usage .89 17.0 7.0 2.3
*Clerical Speed and Accuracy is not reported because it is a speeded test. As such, internal
consistency measures of reliability are not appropriate.

23
Psychometric Model

Predominantly, the DAT for PCA does not appear to employ IRT for item and test development or
item management. Also, there is no item bank for this cognitive ability battery.

Multiple Forms

DAT for PCA does not have multiple forms. It was derived directly from the DAT Form V in which the
content specifications were unchanged and the DAT for PCA subtests were composed of items
directly from the DAT Form V. Therefore, it can be considered that the DAT for PCA and DAT Form V
are measuring the same aptitudes. In order to determine the equivalency of the two versions,
correlation coefficients were computed between raw scores on each test as a type of alternative forms
reliability estimate. Clerical Speed and Accuracy did not have alternative forms reliability computed
as it remained the same. The data source for the analysis was from the Grades 10 through 12 males’
item response data from the 1982 standardization of DAT Form V. To compute, each DAT test was
scored twice (i.e., First, the Form V raw score was computed, and then the PCA for PCA raw score
was computed by ignoring the responses to the Form V items not used in the new DAT for PCA test.

The correlations are part-whole correlations because the constituent test items of the DAT for PCA
are completely contained in the longer Form V. They cannot be interpreted as alternative form
correlations because part-whole correlations are known to overstate the relationship between
independently measured variables. But, the can be used to support the equivalence of a short form
with a long form. The results in Table 10 show these part-whole correlations for the DAT Form V and
DAT for PCA. Correlations may be very high as the DAT for PCA contains the same items as the
DAT Form V.

Table 10. Part-whole correlations of DAT Form V and DAT for PCA.
DAT Subtest rxx
Verbal Reasoning .98
Numerical Ability .97
Abstract Reasoning .98
Mechanical Reasoning .98
Space Relations .96
Spelling .97
Language Usage .97

Item Banking Approaches

The DAT Form V tests were used as items banks for selection of the DAT for PCA items during
development. However, the DAT for PCA is used as a single fixed form test, requiring proctored
administration in all cases. Pearson has not develop a bank of items to support DAT for PRC and
does not use any methods of randomized or adaptive test construction in the online version for DAT
for PCA.

Approach to Item / Test Security


Test Security

The information gained from the Pearson in regard to test security centered on steps taken to make
sure that test content is not jeopardized within their proctored administration process. Pearson uses
web patrols to monitor internet activity and other resources for any indication that test items or answer
keys have been acquired and distributes or sold. Also, they require test users to agree that DAT for
PCA will be proctored by a trained, qualified test administrator. Lastly, Pearson is prepared to
develop new equivalent items to ensure item currency and security or the creation of customized
forms as clients might require.
24
Criterion Validity Evidence

Criterion Validity

Table 11 summarizes criterion validity results for the DAT for PCA. The results support the
conclusion that DAT for PCA total test scores predict future job performance for a number of
occupations.

Table 11. Criterion validity evidence in various occupations with the DAT for PCA.
Sample
Group N Test r Criterion
Composition
VR .39
Skilled
87 NA .25
tradesman
MR .31
VR .43
Electricians, NA .29
66
Mechanics MR .36
SR .18
VR .35 Manager ranking
NA .29 of mechanical
Mechanics 40
MR .35 performance
SR .27
VR .57
NA .36
Electricians 26
MR .39
SR .05
SR .32
Pipefitters 21
AR .41
Avg. age 43
Skilled
87 VR .37 years, 20
Tradesman
years
Electricians, company
66 SR .18
Mechanics Manager ranking tenure, 12.5
Mechanics 40 SR .27 of ability to take years in
initiative current job.
Electricians 26 SR .05 Males, 83
SR .32 White, 3
Pipefitters 21 Black, 1
AR .53
Hispanic.
Manager ranking
Pipefitters 21 AR .47
of communication
Manager ranking
Pipefitters 21 AR .47 of planning and
organization
VR .37
Skilled NA .30
87
tradesman MR .35
SR .24
VR .38
Manager ranking
Electricians, NA .35
66 of problem
Mechanics MR .34
identification skill
SR .17
VR .34
NA .32
Mechanics 40
MR .37
SR .00

25
VR .44
NA .45
Electricians 26
MR .31
SR .46
VR .32
Pipefitters 21
NA .51
VR .28
Skilled
87 NA .31
tradesman
MR .27
VR .33
Electricians,
66 NA .35
Mechanics
MR .29 Manager ranking
VR .31 of problem
Mechanics 40 NA .40 resolution skill
MR .27
VR .39
Electricians 26 NA .34
MR .34
Pipefitters 21 AR .39
VR .33
Skilled NA .31
87
tradesman MR .33
SR .25
VR .38
Electricians, NA .35
66
Mechanics MR .39
SR .22
Manager ranking
VR .37
of overall
NA .40
Mechanics 40 performance
MR .39
SR .19
VR .40
NA .34
Electricians 26
MR .39
SR .32
SR .24
Pipefitters 21
AR .44
VR .33
Skilled NA .30
87
tradesman MR .40
SR .29
VR .36
Electricians, NA .33
66
Mechanics MR .44
SR .24 Manager ranking
VR .28 of overall
NA .34 performance
Mechanics 40
MR .43 potential
SR .12
VR .51
NA .42
Electricians 40
MR .49
SR .52
SR .37
Pipefitters 21
AR .49

26
Translations / Adaptions

No translations are available. The DAT for PCA is available only in English.

User Support Resources

Pearson provides resources for registered and qualified users. Resources supporting DAT products
include:

 Technical Manuals
 User Guides
 Information Overviews
 Sample Report

Evaluative Reviews

Twelfth Mental Measurements Yearbook

The reviews are generally positive regarding the DAT for PCA, and especially praising its longevity
and wide range of uses. It is noted that there is substantial criterion validity for use as a general
screening instrument for employment purposes. The authors purport it to being an excellent test for
personnel and educational assessment. As noted previously, it is also continuously being revised and
updated which enhances its future use. There are some weaknesses around test bias and normative
information which are said could be improved with further work (Willson & Wang, 1995). While
normative and validity evidence exists, authors of the review suspect that test users could find more
useful information if the publisher were to continue such studies. All in all, the authors support future
use of the DAT for both personnel selection and vocational guidance.

The DAT is also reviewed and reported to have strong reliability and validity evidence, and is also
purported to be one of most frequently used cognitive ability batteries (Wang, 1993). Wang (1993)
states that it has a very high quality, credibility, and utility, which make such a well-founded battery.
Also, new items have only added to its improvements and maintained its psychometric qualities.

DESCRIPTIVE RESULTS FOR EAS

Overall Description and Uses

Introduction

PSI's Employee Aptitude Survey (EAS) consists of 10 subtests and is intended to measure eight
cognitive abilities.

The EAS was originally published in 1963, and since then it has been used extensively for personnel
selection and career guidance applications. Since its development, it has maintained the 10 original
subtests which have been shown to be both reliable and valid predictors of future performance in
many different occupations. These abilities as measured by the EAS have been found to be
important for a wide variety of jobs. The Verbal Comprehension subtest was designed to assess the
ability to understand written words and the ideas associated with them. The Numerical Ability subtest
assesses the ability to add, subtract, multiply, and divide integers, decimals, and fractions. Visual
Pursuit was created with the intent of measuring a person's ability to make rapid, accurate scanning
movements with the eyes. Visual Speed and Accuracy measures the ability to compare numbers or
patterns quickly and accurately. Space Visualization assesses the ability to imagine objects in three-
dimensional space and to manipulate objects mentally. Numerical Reasoning was created with the
intent to measure a participant's ability to analyze logical numerical relationships and to discover

27
underlying principles. Verbal Reasoning measures the ability to combine separate pieces of
information and to form conclusions on the basis of that information. Word Fluency was designed to
assess an individual's ability to generate a number of words quickly without regard to meaning.
Manual Speed and Accuracy measures the ability to make repetitive, fine finger movements rapidly
and accurately. Symbolic Reasoning assesses the ability to apply general rules to specific problems
and to derive logical answers.

Purpose of Use

The Employee Aptitude Survey (EAS) was designed as a personnel selection and career guidance
instrument. In addition to selection and career guidance, the EAS can also be used for placement,
promotions, and training and development. The abilities that the EAS was designed to assess are
verbal comprehension, numbers, word fluency, reasoning, mechanical reasoning, space visualization,
syntactic evaluation, and pursuit. Each of these abilities is seen as important for a wide ranging
variety of jobs such as professional/managerial/supervisory, clerical, production/mechanical (skilled
and semi-skilled), technical, sales, unskilled, protective services, and health professionals. Each of
the subtests was developed separately and by multiple authors. Much of the content of the items
appears to be work neutral. PSI has taken care to make sure that the EAS is easy to administer,
score, and interpret.

Target Populations

The populations targeted for the EAS are candidates who are at least 16 years and older or working
adults in wide ranging jobs and occupations such as professional/managerial/supervisory, clerical,
production/mechanical (skilled and semi-skilled), technical, sales, unskilled, protective services, &
health professionals. Normative information for the EAS is presented in the EAS Examiner’s Manual,
with additional norm tables provided in the EAS Norms Report. A total of 85 norm tables, 65 job
classifications, and 17 general or educational categories are available.

Target Jobs/Occupations

As previously mentioned, the EAS is not intended just for a few specific occupations, but instead a
wide ranging field such as such as professional/managerial/supervisory, clerical,
production/mechanical (skilled and semi-skilled), technical, sales, unskilled, protective services, and
health professionals. As such, the EAS can be tailored using numerous combinations of the subtests
to best suit the needs and requirements of the target occupation. Examples of the occupations the
EAS can be used for selection with:

A. Professional, Managerial, and Supervisory


a. Jobs specializing in fields such engineering, accounting, personnel relations, and
management. These are jobs that most likely require college or university level
education.
i. Examples of tasks would be scheduling, assigning, monitoring, and
coordinating work.
B. Clerical
a. Jobs that are administrative in nature.
i. Examples of tasks are preparing, checking, modifying, compiling, and
maintaining documents/files. Other tasks may be coding and entering data.
C. Production/Mechanical (Skilled and Semi-skilled)
a. Jobs that require specific and standardized procedures.
i. Examples of tasks are operating, monitoring, inspecting, troubleshooting,
repairing, and installing equipment. These jobs may also involve
calculations, computer operation, and quality control activities.
D. Technical

28
a. Jobs that specialize in fields such as engineering, information management systems,
applied sciences, and computer programming. These are jobs which normally
require junior college or technical education.
i. Examples of tasks are programming and computer operations.
E. Sales
a. Jobs that involve product marketing or selling of services.
i. Example of a tasks is product demonstrations with potential clients.
F. Unskilled
a. Jobs requiring no advanced or technical education. Usually involve the performance
of simple, routine, and repetitive tasks in structured environments.
G. Protective Services
a. Jobs focusing on promoting health, safety, and welfare of the public. Jobs may police
officer, fire fighter, and security guard.
H. Health Professional
a. Jobs focusing on medical, dental, psychological, and other health services. These
jobs usually require specialized training and education from colleges or universities.

Spread of Uses

The EAS was designed as a test of cognitive abilities primarily for use as personnel selection and
career guidance. However, the EAS is also been used for placement, promotions, and training and
development. It was primarily designed for individuals 16 years of age or older and working adults in
a wide range of occupations. It is not intended solely for specific jobs or occupations. Examples of
target occupations have been previously described in the section above. For this range of
applications, PSI has assembled 85 norm tables. For many occupations, the technical manual
provides total score and percentile norms.

Administrative Details
Administrative detail is summarized in Table 12 and briefly described below.
Table 12. Administrative features of the EAS subtests.

Subtest Methods of
# Items Time Limit Scoring Rule
Delivery

EAS 1-Verbal 30 5 minutes Total score with Paper-pencil;


Comprehension ability to link to proctored. Online
percentile score. (ATLAS); proctored
EAS 2-Numerical Ability 75 2-10 minutes and unproctored.
EAS 3-Visual Pursuit 30 5 minutes

EAS 4-Visual Speed and 150 5 minutes


Accuracy

EAS 5-Space Visualization 50 5 minutes

EAS 6-Numerical 20 5 minutes


Reasoning

EAS 7-Verbal Reasoning 30 5 minutes

EAS 8-Word Fluency 75 5 minutes

EAS 9-Manual Speed and 750 5 minutes


Accuracy

EAS 10-Symbolic 30 5 minutes


Reasoning

29
Time Limits

A concept that guided the development of the EAS was maximum validity per minute of testing time,
as reported in the technical manual. While several of the subtests are composed of more than 50
items (i.e., EAS 1-Verbal, EAS 2-Numerical Ability Comprehension, EAS 4-Visual Speed and
Accuracy, and EAS 5-Space Visualization), each of the subtests is limited to five minutes. The
exception is the EAS 2-Numerical Ability Comprehension, which has a 2 to 10 minute time limit
depending on items used. Overall, EAS subtests have among the shortest time limits subtests among
commercially available cognitive batteries used for personnel selection. Test time was a critical
consideration when PSI developed the EAS.

Number and Types of Items

Table 12 shows the number of items for each of the 10 subtests, ranging from 20 to 750 items. None
of the 10 subtests appear to have work related content in the items, and therefore contain work-
neutral item content. Three of the subtests are related to reasoning, two are numerical, two are
verbal, and three are related to visual abilities. Each of the subtests is composed of multiple choice
items which measure the previously discussed abilities. The needed reading level is not specified for
any of the 10 subtests, but assumed to be appropriate for individuals who are at least 16 years old
and work in unskilled occupations.

Type of Score

Scores on each of the subtests are computed as number-correct scores that are transformed to
percentile scores based on relevant norm groups. The EAS may be scored three different ways; (a)
computer automated scoring (using PSI’s ATLAS™ platform), (b) hand scoring using templates, and
(c) on-site optical scanning/scoring. Scoring is computed by the total raw score of items answered
correctly for each subtest. Each raw score determines a percentile score based on relevant PSI norm
groups. Scores can also be banded, pass/fail, or used in ranking of test examinees.

Each of the subtests can be either hand scored or machined scored from scannable test forms. For
hand scoring, keys are provided for scoring both right and wrong responses. Scannable test forms for
machine scoring are available for the eight tests that lend themselves to optical scanning (EAS 1
through 7 and EAS 10).

Test Scale and Methods of Scaling

The score reported for each of the EAS subtests is a “number-correct” raw score. Percentile scores
are provided based on the norm groups. The EAS may be tailored to a target job by combining 3 or 4
of the subtests into a job-specific composite. Scores from each of the chosen subtests can be
combined to form a composite based on SD-based weights, assigned unit weights, or other rationally
or statistically determined weights. While not weighted differently, test items are in order of difficulty
from easy to difficult on each form of the EAS.

Method of Delivery

The EAS may be delivered by proctored paper-pencil administration as well as online administration.
For online use, the EAS is administered through PSI’s own web-delivery platform, ATLAS™. The
ATLAS™ web-based talent assessment management system is an enterprise web-based Software-
as-a-Service (SaaS) solution for managing selection and assessment processes. This platform can
function in a number of capacities (e.g., configuring tests, batteries, score reports, managing test
inventory, manage and deploy proctored and unproctored tests, establish candidate workflow with or
without applicant tracking systems, etc…). One of the important aspects of administering the EAS
using PSI’s own web-delivery platform, ATLAS™, is that it can be unproctored. The EAS may also be
administered individually or in a group. For online unproctored use, PSI has created a form that is
30
specifically used for unproctored use. This form can also be tailored to the client needs, in which case
items may be randomized.

Costs

Test Resource Cost


EAS Subtests Separately (pkg. 25; includes
scoring sheet)
EAS 1-Verbal Comprehension $112.50
EAS 2-Numerical Ability $112.50
EAS 3-Visual Pursuit $112.50
EAS 4-Visual Speed and Accuracy $112.50
EAS 5-Space Visualization $112.50
EAS 6-Numerical Reasoning $112.50
EAS 7-Verbal Reasoning $112.50
EAS 8-Word Fluency $112.50
EAS 9-Manual Speed and Accuracy $112.50
EAS 10-Symbolic Reasoning $112.50
Manuals
EAS Technical Manual $350.00
*Other cost information was not available after extensive efforts to attainment.

Construct-Content Information

Intended Constructs

The types of subtests in the EAS are similar to other commercially available tests of cognitive ability.
It is comprised of 10 subtests designed to measure eight abilities. As shown in Table 13, EAS
consists of three subtests that are related to reasoning, two are numerical, three are verbal/word
related, and three are related to visual abilities, and one related to manual speed. EAS provides the
ability to tailor the battery to specific jobs and occupations by administering various combinations of
the subtests. Specific subtests may be best suited for differing jobs and not just all jobs in general.
This was found from computed intercorrelations between each of the EAS subtests. Each of these
abilities has been shown to be predictive of future job performance across a wide range of jobs.

The developers of the EAS intended to develop a cognitive ability battery that would have the
maximum validity per minute of testing time. As such, they created and combined multiple subtests of
short time limits, each of which measures a specific ability that may be regarded as relevant to certain
types of job tasks/activities. From numerous interviews with human resource managers, they
identified three important considerations that had not been met by other personnel selection tests;
ease of administration, scoring, and interpretation. Therefore, the developers designed several of the
subtests and also adapted several other commonly used tests to be included in the EAS battery. The
sole purpose was to create a cognitive test battery that could be administered to diverse populations
and be used for a wide range of jobs.

PSI defined each of the targeted abilities as follows.

Verbal Comprehension is the ability to use words in thinking and communicating.

Number the ability to handle numbers and to work with numerical material including the ability to
perceive small details accurately and rapidly within materials.

Word Fluency is the ability to produce words rapidly.

Reasoning is the ability to discover relationships and to derive principles.

31
Mechanical Reasoning is the ability to apply and understand physical and mechanical principles.

Space Visualization is the ability to visualize objects in three-dimensional space.

Syntactic Evaluation is the ability to apply principles to arrive at a unique solution.

Pursuit is the ability to make rapid, accurate scanning movements with the eyes.

Table 13. Abilities measured by each of the EAS subtests.

Ability
Comprehension

Word Fluency

Visualization
Mechanical
Reasoning

Reasoning

Evaluation
Syntactic
Number

Pursuit
Verbal

Space
Subtest
EAS 1-Verbal
X
Comprehension
EAS 2-
Numerical X X
Ability
EAS 3-Visual
X
Pursuit
EAS 4-Visual
Speed and X
Accuracy
EAS 5-Space
X X
Visualization
EAS 6-
Numerical X X X
Reasoning
EAS 7-Verbal
X X
Reasoning
EAS 8-Word
X
Fluency
EAS 9-Manual
Speed and
Accuracy
EAS 10-
Symbolic X
Reasoning

Item Content

The EAS utilizes 10 subtests and is intended to measure eight abilities that are important for a wide
range of occupations. Examples of these occupations are professional/managerial/supervisory,
clerical, production/mechanical (skilled and semi-skilled), technical, sales, unskilled, protective
services, & health professionals. To accommodate the wide range of occupations, item content was
generally neutral with respect to specific job content. The technical manual does not specifically state
the reading level required to understand the item content. However, it is noted that the EAS is
intended for use with individuals of at least 16 years of age, and for occupations that are semi-skilled
and possibly even unskilled. Test items are multiple choice, but the number of options varies
depending on the subtest and intended ability being measured.

32
Sample Items

The EAS is comprised of 10 subtests:

A. EAS 1-Verbal Comprehension


a. Assess the ability to understand written words and the ideas associated with them.
i. In this 30-item vocabulary test, the examinee must select the synonym for a
designated word from the four possibilities presented.

B. EAS 2-Numerical Ability


a. Assesses the ability to add, subtract, multiply, and divide integers, decimals, and
fractions.
i. This test is designed to measure ability in addition, subtraction, multiplication,
and division of whole numbers, decimals, and fractions. The examinee is to
add, subtract, multiply, or divide to solve the problem and select a response
from the five alternatives provided: four numerical alternatives and an "X" to
indicate that the correct answer is not given. Integers, decimal fractions, and
common fractions are included in separate tests that are separately timed.

C. EAS 3-Visual Pursuit


a. Measures a person's ability to make rapid, accurate scanning movements with the
eyes.
i. Consists of 30 items. The examinee is to visually trace designated lines
through an entangled network resembling a schematic diagram. The answer,
indicating the endpoint of the line, is selected from five alternatives.

33
D. EAS 4-Visual Speed and Accuracy
a. Measures the ability to compare numbers or patterns quickly and accurately.
i. The 150 items consist of pairs of number series that may include decimals,
letters, or other symbols. The examinee has 5 minutes to review as many
pairs as possible, indicating for each pair whether they are the same or
different.

E. EAS 5-Space Visualization


a. Assesses the ability to imagine objects in three-dimensional space and to manipulate
objects mentally.
i. 50-item test consisting of pictures of piles of blocks. The examinee indicates
for a specific block how many other blocks in the pile it touches. The decision
to use the familiar blocks format was based on its known predictive validity
for a wide variety of mechanical tasks, ranging from those of the design
engineer to those of the package wrapper in a department store.

34
F. EAS 6-Numerical Reasoning
a. Measure a participant's ability to analyze logical numerical relationships and to
discover underlying principles.
i. Twenty number series are included. The examinee selects the next number
in the series from five alternatives.

G. EAS 7-Verbal Reasoning


a. Measures the ability to combine separate pieces of information and to form
conclusions on the basis of that information.
i. A series of facts are presented for the examinee to review. Five conclusions
follow each series. The examinee is to indicate whether, based on the factual
information given, the conclusion is true, false, or uncertain.

35
H. EAS 8-Word Fluency
a. Assesses an individual's ability to generate a number of words quickly without regard
to meaning.
i. Designed to measure flexibility and fluency with words. The examinee writes
as many words as possible beginning with a designated letter.

I. EAS 9-Manual Speed and Accuracy


a. Measures the ability to make repetitive, fine finger movements rapidly and accurately.
i. This test was designed to evaluate the ability to make fine finger movements
rapidly and accurately. The examinee is to place pencil marks within as many
"O"s as possible in 5 minutes.

J. EAS 10-Symbolic Reasoning


a. Assesses the ability to apply general rules to specific problems and to come up with
logical answers.
i. Each of the 30 problems in this test contains a statement and a conclusion.
The examinee marks "T" to indicate the conclusion is true, "F" to indicate it is
false, or "?" to indicate that it is impossible to determine if the conclusion is
true or false based on the information given in the statement. Each statement
describes the relationship between three variables: A, B, and C, in terms of
arithmetic symbols such as =, <, >, =/, </, and >/. Based on the relationship
described, the examinee evaluates the conclusion about the relationship
between A and C.

36
Construct Validity Evidence

To assess the construct validity evidence of the EAS, test developers factor analyzed EAS scores,
excluding Manual Speed and Accuracy using the principal factors method. Test reliabilities were
used to estimate communalities. Eight factors were retained and rotated to simple structure using a
varimax rotation. The rotated factor loadings are shown in Table 14, using the factor labels provided
by PSI. These factor loadings were inspected to identify the subtest characteristics associated with
each of the subtests. PSI described the eight factors with the labels shown in Table 14 based
primarily on the profile of factor loadings of .40 or greater. However, the small sample size (N = 90)
greatly limits the meaningfulness of these results.

Table 14. EAS factor loadings.


Ability Factor
Comprehension

Word Fluency

Visualization
Mechanical
Reasoning

Reasoning

Evaluation
Syntactic
Number

Pursuit
Verbal

Space

Subtest
EAS 1-Verbal Comprehension .82 .11 .20 .25 .19 -.01 .17 .03
EAS 2-Numerical Ability .32 .47 .25 .16 .24 .09 .57 .26
EAS 3-Visual Pursuit .04 .10 .03 .17 .26 .26 .17 .79
EAS 4-Visual Speed and
.21 .72 .35 .09 .03 .11 .06 .33
Accuracy
EAS 5-Space Visualization .08 .16 -.11 .27 .66 .40 .11 .23
EAS 6-Numerical Reasoning .23 .33 .10 .57 .41 .12 .33 .03
EAS 7-Verbal Reasoning .46 .14 .25 .63 .20 .10 .26 .10
EAS 8-Word Fluency .15 .22 .77 .12 -.05 .01 .18 .01
EAS 10-Symbolic Reasoning .25 .03 .18 .30 .24 .25 .63 .18
*EAS 9 was purposely left out of the factor analysis.

The EAS has been correlated with several other tests and performance measures. This knowledge
aids in defining the constructs/abilities that the test purports to measure. Also, it provides a sense of
familiarity for users with past experience of similar tests. The correlations indicate what features the
EAS and other tests have in common. In this sense, tests of the EAS that are intended to measure
the same constructs as other tests should be highly correlated with each other.

37
The EAS was correlated with the Cooperative School and College Ability Tests (SCAT; 1955). Scores
were obtained from 400 junior college students. The SCAT is divided into two parts: verbal and
quantitative. The verbal measures a student's understanding of works and the quantitative measures
an understanding of fundamental numbers operations. Table 15 reports the correlations found. As
expected, verbal-related tests from the EAS correlated with the verbal of the SCAT, and also
quantitative-related tests correlated with the quantitative tests of the SCAT.

Table 15. Correlations between EAS subtests and SCAT.


EAS Test SCAT-V SCAT-Q
EAS 1-Verbal Comprehension
.75 .44
EAS 2-Numerical Ability
.10 .31
EAS 3-Visual Pursuit
.18 .34
EAS 4-Visual Speed and
.02 .07
Accuracy
EAS 5-Space Visualization
.17 .38
EAS 6-Numerical Reasoning
.33 .59
EAS 7-Verbal Reasoning
.51 .53
EAS 8-Word Fluency
.17 .19
EAS 10-Symbolic Reasoning
.01 .06
EAS 1-Verbal Comprehension
.31 .41

One study correlated nine of the EAS subtests with five of the PMA subtests (Thurstone & Thurstone,
1947) with a sample of 90 high school students. The PMA was primarily designed to be used with
high school students and includes five subtests. Although they expected high correlations between
tests of the EAS and PMA that are intended to measure the same ability, other results were evident.
They found that although there were moderate to high correlations where expected, there were also
moderate to high correlations where unexpected. They found that there were much stronger
correlations between EAS 2 and PMA-Verbal, PMA-Reasoning, and PMA-Word Fluency than
originally hypothesized. Table 16 reports the correlations between EAS subtests and the PMS
subtests observed in this study.

Table 16. Correlations between EAS and PMA tests.


PMA Tests
EAS Tests Verbal Space Reasoning Number Word Fluency
EAS 1 .85 .19 .53 .28 .54
EAS 2 .62 .39 .59 .51 .50
EAS 3 .23 .53 .44 .14 .23
EAS 4 .52 .30 .50 .64 .45
EAS 5 .29 .58 .46 .17 .20
EAS 6 .52 .40 .68 .46 .44
EAS 7 .64 .34 .74 .35 .56
EAS 8 .45 .11 .41 .28 .64
EAS 10 .47 .46 .52 .20 .40

PMA subtest definitions:

A. Verbal
a. A vocabulary test similar in format and content to EAS 1.
B. Space

38
a. Measure of the ability to perceive spatial relationships by manipulating objects in a
three-dimensional space.
C. Reasoning
a. Assesses the ability to discover and apply principles.
D. Number
a. Measures one’s ability to add numbers.
E. Word fluency
a. Similar to EAS 8 which measures flexibility and fluency with words.

The EAS subtests, except for EAS 9, were also correlated with the California Test of Mental Maturity
(CTMM; Sullivan, Clark, & Tiegs, 1936). The same sample of 90 high schools students were used for
this study. Like the EAS, the CTMM encompasses a wide variety of item types such as spatial
relations, computation, number series, analogies, similarities, opposites, immediate and delayed
recall, inference, number series, and vocabulary. Table 17 below shows the resulting correlations,
which were at least moderately high.

Table 17. Correlations between the EAS subtests and CTMM.


CTMM – Total Mental Factors IQ
EAS Subtest Sample 1 Sample 2 Sample 3
EAS 1 .72 .75 .83
EAS 2 .70
EAS 3 .31
EAS 4 .44
EAS 5 .43
EAS 6 .66 .70
EAS 7 .67
EAS 8 .40
EAS 10 .63
*
Sample 1- 90 male high school students; Sample 2- 103 management selection examinees in an
aircraft manufacturing facility; Sample 3- 148 prisoners.

Five of the EAS subtests (EAS 1,2,5,6, and 7) were correlated with the Bennett Mechanical
Comprehension Test (BMCT; Bennett & Fry, 1941), as shown in Table 18. The BMCT measures the
ability to derive, understand, and apply physical and mechanical principles. The moderate
correlations for each of the five EAS subtests suggests that there is a general reasoning ability that
contributes to performance on each of the tests.

Table 18. Correlations between EAS subtests and BMCT.


EAS Test BMCT
EAS 1-Verbal Comprehension .37
EAS 2-Numerical Ability .53
EAS 5-Space Visualization .43
EAS 6-Numerical Reasoning .51
EAS 7-Verbal Reasoning .31
*
Sample was 260 applicants for a wide variety of jobs.

Besides the BMCT, the EAS was also correlated with the Otis Employment Test (Otis, 1943), which
measures cognitive abilities. Item types include: vocabulary items, reasoning, syllogisms, arithmetic
computation and reasoning, proverbs, analogies, spatial relation items, number series, etc… It was
expected that the Otis would correlate modestly with the same five subtests used to correlate with the
BMCT. As expected, the EAS subtests correlate moderately with the Otis, and these are shown in
Table 19.
The EAS 4-Visual Speed and Accuracy subtest was correlated with scores from both the Minnesota
Clerical Test (MCT) and the Differential Aptitude Test (DAT) Clerical Speed and Accuracy subtest.
Each of these tests was designed to assess the ability to perceive details rapidly and accurately.
Originally, the MCT was the basis for the development of the EAS 4. Table 20 shows that there is a
39
high correlation for both. A sample of 89 applicants for a wide variety of jobs was used for the
correlation between the EAS and MCT. A sample of 100 inmates was used for the correlations
between the EAS-4 and DAT – Clerical Speed and Accuracy. Results are reported in Table 20.

Table 19. Correlations between the EAS and the Otis.


EAS Test Otis
EAS 1-Verbal Comprehension .55
EAS 2-Numerical Ability .51
EAS 5-Space Visualization .47
EAS 6-Numerical Reasoning .60
EAS 7-Verbal Reasoning .56
*
Sample was 220 applicants for a wide variety of jobs.

Table 20. Correlations between EAS 4, MCT, and the DAT.


DAT – Clerical Speed and
EAS Subtest MCT – Total Score Accuracy
EAS 4-Visual Speed and
.82 .65
Accuracy

The EAS 7-Verbal Reasoning and the Watson-Glaser Critical Thinking Appraisal (W-GCTA; Watson &
Glaser, 1942) were also correlated. The W-GCTA assesses five facet of critical thinking as defined by
Watson-Glaser, one being the ability to from logical conclusions from various facts. While they
expected a moderate correlation, in reality the correlation was moderately low. Table 21 shows the
relevant results.

Table 21. Correlations between EAS 7 and CTA.


CTA
EAS Test Sample 1 Sample 2 Sample 3
EAS 7-Verbal
.45 .26 .59
Reasoning

PSI also computed average intercorrelations between each pair of the EAS subtests. While each of
subtests was designed to measure specific abilities, each of the subtests was designed to measure
more than one ability. Table 22 presents the average intercorrelations among the subtests. Their
findings show that several of the subtests can be combined to predict the same ability.

A. EAS 1 and EAS 7 are related because of the emphasis on the ability to understand words and
concepts associated with them.

B. EAS 2 and EAS 4 are dependent upon the ability to work accurately and with speed with
numbers.

C. EAS 2 and EAS 6 because they rely on an individual’s ability to interpret numerical materials.

D. EAS 3 and EAS 5 share a perceptual component.

E. EAS 6, EAS 7, and EAS 10 because are related because they are influenced by the ability to
derive and apply rules and principles to solve problems.

40
Table 22. Average intercorrelations of EAS subtests.
Average Intercorrelations
Test EAS 1 EAS 2 EAS 3 EAS 4 EAS 5 EAS 6 EAS 7 EAS 8 EAS 9 EAS 10
EAS 1
EAS 2 .26
EAS 3 .08 .20
EAS 4 .10 .41 .30
EAS 5 .22 .28 .40 .34
EAS 6 .26 .43 .18 .19 .35
EAS 7 .40 .34 .15 .16 .30 .37
EAS 8 .27 .31 .09 .23 .14 .18 .19
EAS 9 .03 .12 .23 .26 .24 .05 .01 .16
EAS 10 .27 .33 .17 .22 .29 .38 .37 .14 .10

Item and Test Development

Item Development

Originally, the EAS was designed with 15 short time limit subtests. These subtests were then
administered to 273 employees of a medium-sized factory. Results led to decisions to remove some
subtests from the battery and change others in regards to format, length, and content of instructions.
Each of the subtests of the EAS battery was developed separately by different test developers, or was
adapted from previous tests.

EAS subtest item content, specifications, and development:

A. EAS 1-Verbal Comprehension


a. A large pool of items was given to several groups ranging from college students,
factory workers, to prisoners. Item difficulty was determined for each of the items,
and item score vs. total score phi coefficients were computed. Two alternate forms
were created from the larger pool of items which met statistical criteria of
comparability; mean, standard deviation, and homogeneity.

B. EAS 2-Numerical Ability


a. Examinee must add, subtract, multiply, or divide to solve the problem and select a
response from the five alternatives provided: four numerical alternatives and an "X" to
indicate that the correct answer is not given. Integers, decimal fractions, and common
fractions are included in separate tests that are separately timed. Part 1, with a time
limit of 2 minutes, measures facility in working with whole numbers; Part 2, which
requires 4 minutes of test time, measures facility with decimals; and Part 3, also a 4-
minute test, measures facility with fractions. This test is really a battery of three tests
when this is desired. An “X” is used to indicate that the correct answer is not given.
Two equivalent forms available.

C. EAS 3-Visual Pursuit


a. The original prototype of this test was the Pursuit subtest of the MacQuarrie Tests for
Mechanical Ability (MacQuarrie, 1925). Modifications include: (1) increased time
length, (2) adapted it to machine scoring, (3) created the first few items to be easier,
and (4) redesigned the answer scoring procedure. Face validity was taken into
consideration and symbols and format of electrical wiring diagrams and electronic
schematics were used. Two equivalent forms were created.

D. EAS 4-Visual Speed and Accuracy


a. The design of this test was based upon the Minnesota Clerical Test (MCT) by
Andrew, Paterson and Longstaff (1933). The test length was shortened to five
41
minutes. It was decided that only number series would be used for the items, and
these number series would include decimals, letters, and other symbols. The
characters in each of the items were selected using a random numbers table.

E. EAS 5-Space Visualization


a. This test is based on the MacQuarrie Test for Mechanical Ability (MacQuarrie, 1925)
which had a long and successful use in the United States. The length of the EAS
subtest is twice the length of the MacQuarrie to gain additional reliability. The test
developer also changed the directions for clarity purposes, adapted the test to
machine scoring, and arranged the items in order of increasing difficulty.

F. EAS 6-Numerical Reasoning


a. This test is based on Test 6 of the Army Group Examination Alpha of World War I
(1918). For the EAS 6, it was decided that multiple choice instead of open-ended
items would be used. The items selected for inclusion were derived from a study by
Lovell (1944), in which 20 items were to be used in Form A after analyzing them on
an industrial population. Form B was created by turning out a large pool of similar
items and selecting 20 items of comparable difficulty and homogeneity.

G. EAS 7-Verbal Reasoning


a. This test is based on the California Test of Mental Maturity (CTMM) by Sullivan, Clark
and Tiegs (1936), Subtest 15. Form A of the test represents surviving items of a
larger pool. Form B is composed of items of the same logical form, and differ from
Form A items only in content.

H. EAS 8-Word Fluency


a. This test is an adaption from the SRA Primary Mental Abilities Test (PMA) by
Thurston and Thrustone (1947). Multiple forms exist for this test.

I. EAS 9-Manual Speed and Accuracy


a. This test is based on the Dotting subtest of the MacQuarrie Tests for Mechanical
Ability (MacQuarrie, 1925). In order to increase reliability and face validity, the length
of the test was increased.

J. EAS 10-Symbolic Reasoning


a. This test is based on work by authors who developed a test to measure the ability to
evaluate symbolic relations (Wilson, Guilford, Christianson and Lewis (1954).
Another idea for this test was derived from EAS 7 which used the “X” for uncertain
response category. Verbal instructions were also removed for this test to reduce the
factor loadings on the verbal factor. The items of the two forms have been arranged
in order of increasing difficulty.

EAS test scores have been correlated with scores from a wide variety of tests—tests designed to
assess several distinct abilities. Table 23 summarizes the relationships between the EAS tests and
the kinds of abilities measured by these tests. The resulting outcome of the correlations was as
expected. They found: (a) EAS 1 and EAS 7 correlate very highly with other verbal ability tests, (b)
EAS 2 and EAS 6 correlate highly with other numerical ability tests, and (c) EAS 4 has a strong
relationship with other clerical measures. All of the EAS tests with available data at least moderately
correlate with measures of general mental ability. Table 23 shows the correlations between EAS
subtests and the abilities.

42
Table 23. Average Correlations with EAS and abilities assessed by other batteries.
Type of Ability

General Mental
Verbal Fluency

Mechanical
Reliability*

Reasoning
Numerical

Clerical
Verbal

Ability
Space
Subtest M SD
EAS 1-Verbal
.77 .41 .71
Comprehensio .85 19.2 6.38
(2) (2)
.53 .19 .54 .37
(4)
n
EAS 2-
.21 .35 .57
Numerical .87 45.9 13.70
(2) (2)
.59 .39 .50 .53
(2)
Ability
EAS 3-Visual .19 .31
.86 19.0 4.68 .44 .53 .23 .31
Pursuit (2) (2)
EAS 4-Visual
.12 .19 .74
Speed and .91 93.1 21.80
(2) (2) (2)
.50 .30 .45 .44
Accuracy
EAS 5-Space .19 .34 .46
.89 27.7 10.20 .46 .58 .20 .43
Visualization (2) (2) (2)
EAS 6-
.37 .57 .64
Numerical .81 10.8 4.27
(2) (2)
.68 .40 .44 .51
(3)
Reasoning
EAS 7-Verbal .54 .50 .52
.82 14.3 6.19 .74 .34 .56 .31
Reasoning (2) (2) (5)
EAS 8-Word .22 .21
.76 46.2 12.20 .41 .11 .64 .40
Fluency (2) (2)
EAS 9-Manual
Speed and .75 408.0 89.00 .01 .06
Accuracy
EAS 10-
.34 .37
Symbolic .82 11.6 6.61
(2) (2)
.52 .40 .40 .63
Reasoning
(The number in parentheses is the number of studies on which the correlation is based. If no
number is given, the correlation reported is based on one study.)
*For 9 of the 10 tests (EAS 1 through EAS 8 and EAS 10), reliability was determined using an
alternate form method of estimation. For EAS 9, a test-retest reliability estimate is reported with 2
to 14 days between administrations.
Correlations between EAS subtests are reported in Table 24. These provide support for the optimal
value of the EAS as a battery. The developers computed intercorrelations between each of the EAS
subtests with a sample of educational and occupational groups since it was to be used in a wide
variety of situations. The correlations for each of the groups were averaged using Fisher Z
transformed indices and then the average Fisher Z was transformed back to a correlation.

43
Table 24. Average intercorrelations between EAS subtests.
Tests EAS 1 EAS 2 EAS 3 EAS 4 EAS 5 EAS 6 EAS 7 EAS 8 EAS 9 EAS 10
EAS 1
EAS 2 .26
EAS 3 .08 .20
EAS 4 .10 .41 .30
EAS 5 .22 .28 .40 .34
EAS 6 .26 .43 .18 .19 .35
EAS 7 .40 .34 .15 .16 .30 .37
EAS 8 .27 .31 .09 .23 .14 .18 .19
EAS 9 .03 .12 .23 .26 .24 .05 .01 .16
EAS 10 .27 .33 .17 .22 .29 .38 .37 .14 .10

The EAS 2-Numerical Ability subtest was a bit unusual as it consists of three separately timed
subtests that are specific to integers, decimals, and fractions. Correlations between Part 2 (decimals)
and Part 3 (fractions) were found to be stable across three samples of job incumbents, but
correlations between either Part 2 or Part 3 and Part 1 (integers) are more variable across job
samples. It was found that the latter correlations were the lowest in groups that had not been actively
engaged in arithmetic computation. Table 25 shows these subtest correlations for the three job
samples.

Table 25. Part-score intercorrelations for three parts of the EAS 2.


Electronic Trainees Graduate Engineers Telephone Operators
Parts Correlated (N = 167) (N = 205) (N = 192)
I (Integers) and II
.58 .53 .31
(Decimals)
I (Integers) and III
.62 .45 .35
(Fractions)
II (Decimals) and III
.67 .62 .63
(Fractions)

Psychometric Model

There is no indication that PSI applied IRT-based item analyses to EAS items during the development
process. However, the PSI I-O psychologist indicated that a pool of EAS items has been
accumulating as additional transparent forms have been developed over the years since EAS was
implemented. The purpose this accumulating “bank” of EAS items with IRT estimates serves is to
facilitate the development of new forms of EAS that PSI occasionally introduces in a transparent
manner.

Multiple Forms

There are two forms of each subtest of the EAS, except for EAS 9- Manual Speed and Accuracy. To
evaluate the equivalency of alternate forms, the developers administered both forms of each test to
330 junior college students. Half of the students were given Form A first and the other half were given
Form B first. Results from the analysis revealed that for each pair of forms there were no statistically
significant differences between means and standard deviations.
After speaking with the publisher, we have also learned that there is a third form of the EAS. This
third form is meant specifically for online unproctored testing using PSI’s ATLAS ™ web-based talent
assessment management system.
Item Banking Approaches

As EAS reported, it maintains an accumulating bank of EAS items. But the only purpose that bank
appears to serve is to facilitate the occasional development of new, transparent forms of EAS.

44
Approach to Item / Test Security

Test Security

For all subtests, except EAS 9, there are two forms. Also, for online unproctored administration, PSI’s
ATLAS™ web-based talent assessment management system is available as an enterprise web-based
Software-as-a-Service (SaaS) solution for managing selection and assessment processes. This
platform can function in a number of capacities (e.g., configuring tests, batteries, score reports,
managing test inventory, manage and deploy proctored and unproctored tests, establish candidate
workflow with or without applicant tracking systems, etc…). The ability to administer the EAS
unproctored is a very desirable feature of the ATLAS™. However, there is no information available
about the manner in which PSI handles unproctored test security. The publisher did indicate to us
that online unproctored use is only available internationally, not within the U.S. To support that
unproctored option outside the US, PSI provides a third form developed specifically for international
clients who wish to employ online unproctored testing. However, PSI appears to take no measures
to monitor or minimize or respond to indications of cheating or piracy.

Criterion Validity Evidence

Criterion Validity

PSI conducted meta-analyses of the observed validities for each test-criterion-occupation combination
that contained five or more criterion validity studies. Forty-nine (49) meta-analyses were conducted,
24 for the predicting job performance and 25 for predicting training success. The average observed
and corrected validities were determined for test-criterion-occupation combinations containing fewer
than five studies. Table 26 shows the test-criterion-occupation validity coefficient for each occupation
category and the corresponding subtest, along with sample size and type of measure used.

For the meta-analysis, PSI located 160 criterion validity-related studies; studies from the 1963
technical report, unpublished validation studies from employers and external consultants, and
unpublished literature. Information used in the meta-analysis included (a) job category, (b) criterion
category, job performance or training success, (c) the type of criterion measure, (d) EAS tests used as
predictors, (e) sample size, and (f) the value of the observed validity coefficients. The validation
studies included a broad range of jobs which were grouped into the eight occupational families
described previously.

Validity coefficients were differentiated on the basis of the criterion used in the study, either job
performance or training success. Four criterion measures were identified; (a) performance including
supervisors, rankings of job proficiency, instructor ratings, and course grades; (b) production including
production data and scores on work sample tests; (c) job knowledge which refers to scores on job
knowledge tests; and (d) personnel actions such as hiring and terminating employees. In total, 49
meta-analyses were computed; 24 for prediction of job performance and 25 for the prediction of
training success.

The results of the meta-analyses indicate that 48 of the 49 credibility values were above 0, and
therefore 98% of the test-criterion-occupation combinations showed generalizability across jobs and
organizational settings within major job categories. Results indicated that certain EAS subtests were
better predictors of job performance and training success of various occupations families than others.
For example, under the occupational grouping of Professional, Managerial, and Supervisory, EAS 2
was a better predictor of job performance than EAS 1, EAS 7 was a better predictor than EAS 6, and
EAS 2 was a better predictor of job performance than EAS 7. Table 26 provides the resulting validity
coefficients for the EAS subtests by criterion and occupational grouping.

45
Table 26. Meta-Analysis results for PSI criterion validities for EAS subtests within job families.
Professional, Managerial, & Supervisory
Test
Type of Criterion Total Sample Size ṙ
Criterion Category
1 - Verbal
Comprehension
Job Performance: Personnel Actions 205 .14 .30
2 - Numerical
Ability
Job Performance: Personnel Actions 130 .49 .89
3 - Visual Pursuit
Job Performance: Performance 150 .06 .13
5 - Space
Visualization
Job Performance: Performance 428 .13 .28
Production 142 .20 .42
6 - Numerical
Reasoning
Job Performance: Personnel Actions 130 .27 .56
7 - Verbal
Reasoning
Job Performance: Personnel Actions 130 .43 .81
8 - Word Fluency
Job Performance: Performance 250 .14 .31
Personnel Actions 107 .01 .02
10 - Symbolic
Reasoning
Job Performance: Performance 100 .31 .63
Personnel Actions 128 .26 .54
Clerical
Test
Type of Criterion Total Sample Size ṙ
Criterion Category
1 - Verbal
Comprehension
Job Performance: Production 95 .21 .45

Training Success: Job Knowledge 33 .26 .46


2 - Numerical
Ability
Job Performance: Production 95 .22 .46

Training Success: Job Knowledge 33 .35 .60


3 - Visual Pursuit
Job Performance: Performance 81 .31 .63
4 - Visual Speed
and Accuracy
Job Performance: Production 95 .42 .80

Training Success: Job Knowledge 33 .09 017


6 - Numerical
Reasoning
Job Performance: Production 95 .24 .50

Training Success: Job Knowledge 33 .17 .31


7 - Verbal
Reasoning
Job Performance: Production 95 .24 .49

Training Success: Performance 63 .63 .90


Job Knowledge 33 .19 .29

46
8 - Word Fluency
Job Performance: Performance 108 .35 .68
9 - Manual Speed
and Accuracy
Job Performance: Production 96 .24 .50
10 - Symbolic
Reasoning
Job Performance: Performance 108 .21 .44
Production/Mechanical (Skilled & Semi-skilled)
Test
Type of Criterion Total Sample Size ṙ
Criterion Category
1 - Verbal
Comprehension
Job Performance: Production 69 .03 .07
Job Knowledge 69 .20 .42
2 - Numerical
Ability
Job Performance: Production 39 .24 .43

Training Success: Job Knowledge 39 .46 .73


3 - Visual Pursuit
Job Performance: Production 69 .15 .32
Job Knowledge 296 .41 .79

Training Success: Performance 78 .16 .29


4 - Visual Speed
and Accuracy
Job Performance: Production 69 .09 .20
Job Knowledge 69 .32 .64
Personnel Actions 136 .00 .00
5 - Space
Visualization
Job Performance: Production 69 .29 .59
Job Knowledge 131 .43 .81

Training Success: Production 40 .26 .46


Job Knowledge 40 .36 .61
6 - Numerical
Reasoning
Job Performance: Performance 114 .22 .46
Production 69 .23 .48
Job Knowledge 69 .43 .81
7 - Verbal
Reasoning
Job Performance: Production 69 .15 .32

Training Success: Job Knowledge 131 .42 .80


Performance 104 .28 .49
8 - Word Fluency
Training Success: Performance 78 .17 .31
9 - Manual Speed
and Accuracy
Training Success: Performance 78 -.02 -.05
10 - Symbolic
Reasoning
Job Performance: Performance 157 .19 .40

Training Success: Performance 78 .25 .44


Technical
Test
Type of Criterion Total Sample Size ṙ
Criterion Category

47
1 - Verbal
Comprehension
Training Success: Job Knowledge 471 .60 .88
2 - Numerical
Ability
Training Success: Job Knowledge 471 .58 .85
3 - Visual Pursuit
Job Performance: Performance 99 .06 .14

Training Success: Job Knowledge 471 .39 .65


4 - Visual Speed
and Accuracy
Training Success: Job Knowledge 471 .37 .63
5 - Space
Visualization
Training Success: Job Knowledge 471 .45 .72
6 - Numerical
Reasoning
Job Performance: Performance 143 .21 .44

Training Success: Job Knowledge 471 .46 .73


7 - Verbal
Reasoning
Training Success: Job Knowledge 471 .48 .76
8 - Word Fluency
Job Performance: Performance 9 .42 .80

Training Success: Job Knowledge 471 .36 .61


9 - Manual Speed
and Accuracy
Job Performance: Performance 231 .04 .08

Training Success: Job Knowledge 381 .36 .61


10 - Symbolic
Reasoning
Job Performance: Performance 53 .43 .81

Training Success: Job Knowledge 471 .64 .91


Sales
Test
Type of Criterion Total Sample Size ṙ
Criterion Category
1 - Verbal
Comprehension
Job Performance: Performance 107 .40 .77

Training Success: Job Knowledge 140 .25 .45


2 - Numerical
Ability
Job Performance: Performance 107 .41 .79

Training Success: Job Knowledge 140 .50 .78


4 - Visual Speed
and Accuracy
Job Performance: Performance 107 29 .58

Training Success: Job Knowledge 140 .09 .17


5 - Space
Visualization
Job Performance: Performance 19 .70 1.11
6 - Numerical Performance 107 .34 .68
Reasoning

48
Job Performance: Job Knowledge 140 .40 .66

Training Success:
7 - Verbal
Reasoning
Job Performance: Performance 88 .25 .51

Training Success: Job Knowledge 140 .26 .46


8 - Word Fluency
Job Performance: Performance 107 .27 .56

Training Success: Job Knowledge 140 .22 .40


Unskilled
Test
Type of Criterion Total Sample Size ṙ
Criterion Category
1 - Verbal
Comprehension
Job Performance: Performance 44 .12 .26
2 - Numerical
Ability
Job Performance: Performance 190 .05 .11
4 - Visual Speed
and Accuracy
Job Performance: Performance 44 .21 .44
5 - Space
Visualization
Job Performance: Performance 186 .10 .22
6 - Numerical
Reasoning
Job Performance: Performance 44 .16 .34
7 - Verbal
Reasoning
Job Performance: Performance 44 -.01 -.02

Protective Services
Test
Type of Criterion Total Sample Size ṙ
Criterion Category
1 - Verbal
Comprehension
Job Performance: Performance 103 .00 .00

Training Success: Performance 150 .35 .60


Production 132 -.09 -.17
Job Knowledge 137 .34 .58
2 - Numerical
Ability
Job Performance: Performance 104 .01 .02

Training Success: Performance 150 .26 .46


Production 132 -.02 -.04
Job Knowledge 137 .32 .55
3 - Visual Pursuit
Training Success: Performance 49 .34 .58
Production 132 .19 .35
Job Knowledge 137 .16 .29
4 - Visual Speed
and Accuracy
Training Success: Performance 49 .27 .48
Production 132 -.04 -.08
Job Knowledge 137 .17 .31
5 - Space 224 .11 .24

49
Visualization Performance
Job Performance: 277 .30 .52
Performance 259 .21 .38
Training Success: Production 264 .19 .35
Job Knowledge

6 - Numerical
Reasoning
Job Performance: Performance 105 .20 .42

Training Success: Performance 150 .31 .54


Production 132 .09 .17
Job Knowledge 137 .35 .60
7 - Verbal
Reasoning
Job Performance: Performance 106 -.04 -.09
Production 102 .20 .42

Training Success: Performance 235 .31 .54


Production 132 -.16 -.29
Job Knowledge 137 .39 .65
8 - Word Fluency
Training Success: Performance 48 .35 .60
9 - Manual Speed
and Accuracy
Training Success: Performance 49 .34 .58
10 - Symbolic
Reasoning
Job Performance: Performance 106 .03 .07

Training Success: Performance 134 .40 .67


Health Professional
Test
Type of Criterion Total Sample Size ṙ
Criterion Category
1 - Verbal
Comprehension
Job Performance: Production 118 .52 .93

Training Success: Performance 96 .15 .28


Production 30 .37 .62
2 - Numerical
Ability
Training Success: Performance 96 .06 .11
Production 30 .32 .55
3 - Visual Pursuit
Training Success: Performance 96 -.01 -.02
Production 29 .32 .55

4 - Visual Speed
and Accuracy
Training Success: Performance 96 .04 .08
Production 30 .40 .66
5 - Space
Visualization
Training Success: Performance 96 .19 .35
Production 30 .22 .40
9 - Manual Speed
and Accuracy
Training Success: Performance 96 .26 .46
Production 30 .13 .24

50
Meta-analytic results were also aggregated with major Occupational Groupings, shown in Table 27.
Statistical characteristics were only computed for occupational grouping-criterion combinations where
sufficient data was available. A test was included in the battery if the test validity was generalizable
across jobs and settings. The premise of this study was that test battery validity is likely to be greater
than the validity of any single predictor. The rationale is that by adding tests you are either improving
the measurement of some cognitive ability or you are improving the ability of the battery to predict by
adding measures of new and independent abilities. Table 27 shows the validity generalization model,
in which validity did in fact generalize across jobs and organizational settings. The implication for
employers is that, in those instances where EAS validity generalizes, it would not be necessary to
conduct a local validation study.

Table 27. Validity generalization results: Generalized mean true validity for EAS subtests.
Occupational Grouping
Professional, Production/Mechanical
EAS Managerial, and Clerical (Skilled and Semi- Technical
Test Supervisory skilled)
Job Training Job Training Training Job Training
Job Perf
Perf Success Perf Success Success Perf Success
EAS 1 .52 .42 .37 .35 .62 .16 .48
EAS 2 .55 .60 .46 .38 .69 .53 .75
EAS 3 .26 .35 .40
EAS 4 .28 .46 .32 .31 .30 .39
EAS 5 .33 .49 .37 .48 .37 .46
EAS 6 .63 .33 .29 .34 .63
EAS 7 .67 .29 .46 .13 .33 .47
EAS 8 .39 .21
EAS 9 .27 .27 .28 .29
EAS 10 .47 .59
Note: All validity generalization results are based on performance criteria (e.g., supervisory ratings
and course grades).

PSI also conducted a study to determine which combinations of EAS subtests can be combined to
create composites that will be maximally predictive of specific job families. Table 28 shows the
validities coefficients for a combination of EAS subtests and related job families. These combinations
of subtests have been recommended by PSI to assess an applicant’s abilities relating to specific job
families.

51
Table 28. Reported validity coefficients for combinations of EAS subtests and job families.
Subtests Included in Battery Empirically Tailored to Job Family

EAS 1- Verbal Comprehension

EAS 10- Symbolic Reasoning


EAS 6- Numerical Reasoning
EAS 5- Space Visualization

EAS 7- Verbal Reasoning


EAS 4- Visual Speed and
EAS 2- Numerical Ability

EAS 3- Visual Pursuit

Accuracy
Validity*of
Tailored
Job Family Battery
Professional,
Managerial, & X X X X .57
Supervisory
Clerical Battery
X X X X .42
Production/
Mechanical X X X X .35
Battery
Technical
Battery X X X X .46

IT Battery
X X X X .41

General Mental
Ability Battery X X X .31
(g)**
*Based upon meta-analysis of over 160 studies; validities are N-weighted mean coefficients adjusted
for criterion unreliability.
**Includes validity studies from across the occupational spectrum.

A single study by Kolz, McFarland and Silverman (1998) provided support for criterion validity
evidence of the EAS. They looked at the EAS Ability Composite and a measure of job performance
for incumbents with a certain amount of job experience. Overall, their results indicate that the EAS
composites predict job performance better as experience increases. Therefore, The EAS predicts job
performance better for candidates with more 10+ years of job experience than candidates who have
1-years of experience. Results are displayed in Table 29.

Table 29. Correlations between EAS ability composites and job performance at different levels of job
experience.
Job Experience Level in Years
EAS Ability 1-3 4-6 7-9 10+ Total
Composite (N = 33) (N = 54) (N = 39) (N = 50) (N=176)
Mechanical .05 -.07 .36 .27 .18
Arithmetic .07 .18 .35 .42 .25
Logic -.05 .04 .54 .50 .23
1
Each correlation shows the relationship between an EAS Ability Composite and a measure of job
performance for incumbents with a certain amount of job experience. For example, the .05
correlation in the Mechanical – 1-3 cell indicates that among the 33 incumbents with 1-3 years of job
experience, the EAS Mechanical composite correlated .05 with job performance. In contrast, for
52
example, the EAS Mechanical composite correlated .36 with job performance among the 39
incumbents who had 7-9 years of job experience. Overall, these results indicate that the EAS
composites predict job performance better as experience increases.

Translations / Adaptions

Translations

The EAS is available in English, Spanish, French, and German. The publisher indicated that they
often use a vendor for test translations. They reported that not all 10 of their tests have been
translated into other languages either, and to this date EAS has not been translated into Arabic. They
have found that translations on average result in 80% of the items being correct in their content after
cultural and language translations have been done.

User Support Resources

 Technical manual
 Fact sheet
 Descriptive Report
 Note: PSI does not publicly display user support resources.

Evaluative Reviews

Fourteenth Mental Measurements Yearbook

Overall, the reviews of the reviewers have generally provided positive feedback (Engdahl &
Muchinsky, 2001). For instance, it is said that the EAS compares favorably with other multifactor
ability batteries for use in selection and vocational guidance. It is also noted that is has been
extensively used as demonstrated by its solid heritage and long record of usefulness. The reviewers
also mentioned the usefulness of the materials presented by the publishers. The only issues appear
to be around validity and using the EAS for selection purposes with upper level jobs.

An older review by Crites (1963) reported that the EAS is based upon sound rationale and consists of
subtests with proven validities. The concern at this time was computing correlations with large
samples and using job success criteria. Otherwise, a well-developed test and can be used for
selection and career guidance.

Siegel (1958) also reviewed the EAS and reported that the EAS is a great instrument in which tests
can be used singly or in combined with other tests to create composites. The best features are the
simplicity and ease of administration, as well as the scope of coverage of the tests.

DESCRIPTIVE RESULTS FOR GATB


Overall Description and Uses

Introduction

The General Aptitude Test Battery (GATB) was originally developed in 1947 by the United States
Employment Service (USES) for use by state employment offices. The intended use was to match
job applicants with potential employers in private and public sector organizations. It came to be used
as an instrument that private sector organizations were using to screen large numbers of applicants

53
for open positions. The US Department of Labor even proposed the use of the GATB as a
prescreening and referral instrument for nearly all jobs in the United States.

The GATB originally consisted of 12 separately timed subtests used to measure 9 abilities (i.e., verbal
ability, arithmetic reasoning, computation, spatial ability, form perception, clerical perception, motor
coordination, manual dexterity, and finger dexterity). Because of concerns raised by a review of the
battery by the U.S. National Research Council of the National Academy of Sciences (NCS) in the late
1980's, several changes were made to the GATB. The main concerns were those relating to overall
look of the test and accompanying resources, cultural bias, speededness, scoring, and susceptibility
to coaching. These concerns were the basis for developing new forms of the GATB, Forms E and F.
Our focus therefore is on GATB Forms E and F as these were the forms intended for selection
purposes, not just vocational and career guidance such as earlier versions.

Development of the GATB Forms E and F had the objectives of reducing test speededness, and
therefore less susceptible to coaching. This was done by reducing the number of items and
increasing the time limits for a few of the subtests. The developers also incorporated more
appropriate scoring procedures, developed better instructions for the examinee, developed test items
free from bias, assemble parallel test forms, improved overall looks of the test and accompanying
resources, and revised the answer sheets and other supporting documents to be consistent with
changes to test format. During revision, the Form Matching subtest was dropped from the battery,
leaving 11 subtests as shown in Table X1.

Forms E and F were repurposed as O*Net Ability Profiler which is used exclusively for vocational
counseling, occupational exploration, and career planning. The O*Net Ability Profiler is offered
through the U.S Department of Labor, Employment and Training Administration. Because the Ability
Profiler is connected to O*Net, it is linked to more than 800 occupations within the U.S. This particular
instrument is not intended for personnel selection.

Forms E and F were initially intended to be used as a cognitive ability battery for personnel selection,
vocational counseling, and occupational exploration. As with the previous forms, Forms E and F were
created to be used in a wide range of occupations (i.e. for nearly every job/occupation in the U.S.)
The subtests were developed using work-neutral item content, and consequently not specific to any
one occupation group. Since it was to be used across a wide range of occupations, both government
and private organizations, it was to be easily scored and interpreted by test administrators.

According to Hausdorf, LeBlanc, and Chawla (2003), although the GATB does predict future job
performance, it has demonstrated differential prediction and adverse impact against African
Americans (a 1 SD mean difference) in the U.S. (Hartigan & Wigdor, 1989; Sackett & Wilk, 1994;
Wigdor & Sackett, 1993).

Note: Although Form Matching was eventually dropped, the final decision came late in the project
during the development of Forms E and F. Consequently, it was included in many of the development
steps described in this chapter. Also, Tool Matching was eventually renamed Object Matching
(Mellon, Daggett, & MacManus, 1996).

Targeted Populations

The populations targeted by GATB Forms E and F were intended to for individuals of at least 16 years
of age and/or working adults in the U.S. who would be working in nearly any occupation or job. Being
intended for such a wide range of occupations required that the GATB be composed at a relatively
low reading level of grade six. As such, a wider range of individuals from semi-skilled (perhaps even
unskilled) occupations through managerial and even healthcare practitioners could be administered
the cognitive test battery.

54
Targeted Jobs / Occupations and Spread of Uses

The GATB Forms E and F were designed to be used for personnel selection, vocational counseling,
and occupational exploration with nearly every job/occupation in the United States. It was designed
for working adults in the U.S., and had a reading level of at least grade six.

Administrative Detail

Table 30 shows administrative details for Forms E and F of the GATB.

Table 30. Administrative detail for GATB Forms E & F.


Subtest Methods of
# Items Time Scoring Rule Delivery
1. Name Comparison 90 6 minutes # Correct Raw Score. Paper-pencil;
Percentile scores and proctored
2. Computation 40 6 minutes stanines reported for norm
groups.
3. Three-Dimensional 20 8 minutes
Space
4. Vocabulary 19 8 minutes

5. Object Matching 42 5 minutes

6. Arithmetic 18 20 minutes
Reasoning
7. Mark Making 130 60 seconds The administrator and Materials /
examinee count and record apparatus
the number of marks. supplied for test;
8. Place 48 pegs 15 seconds Record the total number of proctored
per peg attempts made by the
examinee. The number of
attempts is equal to the
number of pegs moved,
regardless of whether they
have been turned in the
proper direction or not.
9. Turn 48 pegs 15 seconds Record the total number of
per peg attempts made by the
examinee. The number of
attempts is equal to the
number of rivets moved,
regardless of whether or not
they were properly
assembled or inserted. The
Part 10 score can be
determined by counting: _
The number of empty holes in
the upper part of the board; or
_ The number of rivets
inserted in the lower part of
the board plus the number of
rivets dropped; or _ The
number of rivets remaining in
the holes in the upper part of
the board subtracted from the
total number of rivets (50).
Two of these methods should
be used in scoring, one as a
check for the other.
10. Assemble 1 trial, 50 90 seconds Record the total number of

55
rivets attempts made by the
examinee. The number of
attempts is equal to the
number of rivets moved,
regardless of whether or not
they were properly
disassembled or reinserted.
The Part 11 score can be
determined by counting: _
The number of empty holes in
the lower part of the board; or
The number of rivets inserted
in the upper part of the board
plus the number of rivets
dropped; or _ The number of
rivets remaining in the holes
in the lower part of the board
subtracted from the total
number of rivets (50). Any
two of these methods should
be used in scoring, one being
used as a check on the other.
11. Disassemble 1 trial, 50 60 seconds Same as above for 11.
rivets Assemble.
*
Form Matching was dropped for GATB Forms E & F.

Number and Type of Items

The number of items for each of the 11 subtests is shown in Table 30. The GATB Forms E and F
were designed to be equivalent forms, containing the same number and types of items. The items
contained therein are composed of work-neutral content. The power tests are a combination of forced
choice and multiple choice items, whereas the speeded tests are psychomotor related in which they
assess a participant’s manual and finger dexterity. The creation of the GATB Forms E and F resulted
in fewer items for several of the subtests, compared to the earlier Forms A through D. Changes were
also made to the order of the power subtests (i.e., subtests 1 – 6). The major differences that arose
from the creation of Forms E and F compared to Forms A through D were that (a) the total number of
items across the subtests was reduced from 434 items to 224 items, (b) the total time to take the
battery increased from 42 minutes to 51 minutes, and (c) Form Matching was eventually dropped from
the battery, leaving 11 subtests instead of 12.

Time Limits

Overall administration time is 2.5 hours. The seven paper-pencil subtests (i.e., Subtests 1 -7) can be
administered in approximately 1.5 to 2 hours. This would be an option for individuals who are not
applying for occupations or needing career/vocational guidance in jobs that do not require manual or
finger dexterity. Also, there are six non-psychomotor subtests (i.e., subtests 1 through 6) of the GATB
Forms E and F which can be administered together in about 1.5 to 2 hours. If individuals are not
applying for or seeking career guidance for occupations requiring psychomotor abilities, then subtests
1-6 would be a sufficient combination of subtests. The total time to take all 11 subtests is 51 minutes,
not including administration and instructive time.

Type of Scores

The original GATB Forms A through D used number-correct scoring in which the final score is
calculated by adding the number of questions answered correctly for each subtest. Each of subtests
1 through 6 consists of multiple choice items. For these forms, there were no penalties for incorrect
answers. One issue with this type of scoring was that examinees that would guess and even respond

56
randomly but rapidly, could increase their total score with the speeded tests. Procedures were taken
to reduce the speededness of the power tests in order to reduce the influence of such test taking
strategies.

For the three remaining speeded tests (i.e., Computation, Object Matching, and Name Recognition), a
conventional scoring formula was chosen that would impose a penalty for examinees who respond
with an incorrect response. The premise of the formula is that incorrect responses are based upon
the total number response alternatives for each subtest item. For instance, if there are k alternatives,
and an examinee responds randomly, then there will be k - 1 incorrect response for every correct
response by the examinee. Therefore, this formula imposes a penalty for incorrect responses that will
cancel out the number of correct response expected purely by chance alone through an examinee’s
random responses to subtest items. The basic form of this formula is R-W / (k – 1). R is the number
of correct responses, W is the number of incorrect responses, and k is the number of options for each
subtest item. The specific formulas for each of the subtests are as follows:

A. Computation
th
a. R – W / 4 (for each incorrect response, there is a reduction of 1 / 4 of a point)

B. Object Matching
rd
a. R – W / 3 (for each incorrect response, there is a reduction of 1 / 3 of a point)

C. Name Comparison
a. R – W (for each incorrect response, there is a reduction of 1 point)

Number-correct scoring remained for the power tests of the GATB Forms E and F. Alternative scoring
methods were considered, but the number-correct approach was selected, in a part, because it was
thought to be easier for test takers to understand.

Method of Delivery

The GATB Forms E and F are administered in a paper-pencil test booklet, as well manual and finger
dexterity boards, proctored by a test administrator. Subtests 1 through 6 are answered by examinees
in a booklet form, whereas subtests 7 through 11 have materials/apparatuses that are supplied by
and overseen by the test administrator. The power subtests (i.e., subtests 1 through 6) of the GATB
battery can be either administered to a single individual or in a group setting, but it was recommended
that the power tests be administered prior to the three speeded tests (i.e., subtests 7 through 11,
Mark Making, Place, Turn, Assemble, and Disassemble, respectively). There appears to be no online
method of test administration for the power subtests.

57
Cost
Table 31. Cost of GATB test materials and resources.
Item Price
Manuals
Administration & Scoring $56.75
GATB Application and Interpretation Manual $94.75
Test Booklets
Booklet 1 Form A (Parts 1 - 4) $99.00
Booklet 2 Form A (Parts 5 - 7) $99.00
Booklet 1 Form B (Parts 1 - 4) $99.00
Booklet 2 Form B (Parts 5 - 7) $99.00
Answer Sheets (pkg. 35)
Form A (Parts 1 – 7) $34.75
Form B (Parts 1 – 7) $34.75
Mark Making Sheets (Part 8) $25.00
Dexterity Boards
Pegboard (Manual Dexterity) $254.75
Finger Dexterity Board $179.50
Scoring Masks
Form A (Parts 1 – 7) $49.70
Form B (Parts 1 – 7) $49.70
Recording Sheets (pkg. 35)
Self-estimate Sheets $25.00
Result Sheets $53.50
Additional Resources
Examination Kit $175.00
GATB Score Conversion/Reporting Software $359.50
Interpretation Aid Charts $12.00
*Costs are from Nelson which publishes GATB Forms A and B in Canada.

Construct - Content Information

Intended Constructs

GATB Forms E and F are comprised of subtests similar to subtests in many other commercially
available cognitive ability tests. The GATB Forms E and F include two verbal, two quantitative (one of
which is based upon reasoning), one object matching, one spatial, and three psychomotor (finger and
manual dexterity subtests are comprised of two tests using the same apparatus) subtests. What sets
the GATB Forms E and F cognitive battery apart from most other commercially available cognitive
batteries is that it its inclusion of psychomotor ability subtests, including motor coordination, finger
dexterity, and manual dexterity. Like most commercially available cognitive batteries it measures a
range abilities that are related to a wide range of occupations in the U.S. (nearly every occupation/job
in the U.S.).

The developers of the GATB Forms E and F intended to revise the previous GATB Forms A through
D, which had been reviewed and decided upon that revisions were necessary. The revision
objectives were (a) to reduce speededness, and susceptibility to coaching, (b) establish more
appropriate scoring procedures, (c) develop better instructions for the examinee, (d) develop test
items free from bias, (e) assemble parallel test forms, (f) improve the overall “look-and-feel” of the test
and accompanying resources, and (g) revise the answer sheets and other supporting documents to
be consistent with changes to test format. Also, Forms E and F do not include the Form Matching
subtest.

Abilities measured by the GATB subtests:

58
Verbal Ability (VA) is the ability to understand the meaning of words and use them effectively in
communication when listening, speaking, or writing.

Arithmetic Reasoning (AR) is the ability to use math skills and logical thinking to solve problems in
everyday situations.

Computation (CM) is the ability to apply arithmetic operations of addition, subtraction, multiplication,
and division to solve everyday problems involving numbers.

Spatial Ability (SA) is the ability to form and manipulate mental images of 3-dimensional objects.

Form Perception (FP) is the ability to quickly and accurately see details in objects, pictures, or
drawings.

Clerical Perception (CP) is the ability to quickly and accurately see differences in detail in printed
material.

Motor Coordination (MC) is the ability to quickly and accurately coordinate hand or finger motion
when making precise hand movements.

Manual Dexterity (MD) is the ability to quickly and accurately move hands easily and skillfully.

Finger Dexterity (FD) is the ability to move fingers skillfully and easily.

Table 32 shows the linkages between the target abilities and the subtests.

Table 32. Abilities measured by GATB subtests.


Abilities
Subtest VA AR CM SA FP CP MC MD FD
1. Name
X
Comparison
2. Computation
X
3. Three-
Dimensional X
Space
4. Vocabulary
X
5. Object
X
Matching
6. Arithmetic
X
Reasoning
7. Mark Making
X
8. Place
X
9. Turn
X
10. Assemble
X
11. Disassemble
X

Item Content

The GATB Forms E and F consists of 11 subtests that are intended to measure nine abilities
important for a wide range of occupations. Because the GATB Forms E and F were intended to be

59
used with nearly every occupation in the U.S., the item content was developed at a lower level of
difficulty than many commercially available cognitive batteries as it may be used with semi-skilled
occupations, and perhaps unskilled, through professional and healthcare practitioners. The reading
level of the GATB Forms E and F subtests was set to a relatively low level of grade six. While it is
intended for personnel selection and occupational guidance, all item content is work-neutral.
Subtests 1 through 6 are power tests, and therefore are multiple choice questions. Subtests 7
through 11 are speeded tests in which the examinee is administered and overseen by an
administrator the materials or apparatuses to measure the specific abilities.

Description of Subtests and Sample Items

1. Name Comparison
a. Consists of determining whether the two names are the same or different.

b.

2. Computation
a. Consists of mathematical exercises requiring addition, subtraction, multiplication, or
division of whole numbers.

b.

3. Three-Dimensional Space
a. Consists of determining which one of four three-dimensional figures can be made by
bending and/or rolling a flat, two-dimensional form.

b.

4. Vocabulary
a. Consists of indicating which two words out of four have either the same or opposite
meanings.

b.

5. Object Matching
a. Consists of identifying the one drawing out of four that is the exact duplicate of the
figure presented in the question stem.

60
b.

6. Arithmetic Reasoning
a. Consists of mathematical word problems requiring addition, subtraction,
m
u
l
t
i
p
l
i
c
a
tion, or division of whole numbers, fractions, and percentages.

b.

7. Mark Making
a. Consists of using the dominant hand to make three lines within a square.

b.

8. Place
a. Consists of using both hands to move pegs, two at a time, from the upper part of the
board to the lower part.
b.

c.

9. Turn

61
a. Consists of using the dominant hand to turn pegs over and insert them back into the
board.

b.
*Actually Subtest 9. This was from an earlier GATB resource.

10. Assemble
a. Consists of using both hands to put a washer on a rivet and move the assembled
piece from one part of the board to another.

b.
*Actually Subtest 10. This was from an earlier GATB resource.

11. Disassemble
a. Consists of using both hands to remove a washer from a rivet and put the
disassembled pieces into different places on the board.

62
b.
*Actually Subtest 11. This was from an earlier GATB resource.

Construct Validity Evidence

The developers of the DAT for PCA reported correlations between the DAT for PCA subtests and the
GATB. The pattern of correlations in Table 33 below provides support for the GATB: (1) six GATB
cognitive factors are highly related to the DAT battery, (2) the GATB’s general intelligence factor was
highly related with all of the DAT subtests, except for the Clerical Speed and Accuracy, (3) each of the
GATB factors has its highest correlation with the appropriate DAT subtests, and (4) the GATB
perceptual tests and motor tests correlated relatively high with the DAT’s Clerical Speed and
Accuracy subtest. Table 33 shows the correlations between the GATB and DAT.

63
Table 33. Correlations between DAT and GATB subtests.
VR
VR NA + AR CSA MR SR SP LU G V N S P
NA
NA .73
VR +
- -
NA
AR .63 .70 .71
CSA .41 .44 .46 .35
MR .68 .63 .71 .69 .27
SR .68 .67 .72 .70 .38 .72
SP .68 .57 .68 .44 .40 .35 .40
LU .81 .66 .80 .51 .44 .55 .53 .75
G .78 .72 .81 .64 .48 .62 .64 .70 .80
V .76 .64 .76 .58 .46 .57 .55 .68 .81 .94
N .52 .62 .61 .43 .48 .29 .33 .64 .58 .66 .54
S .53 .58 .59 .63 .42 .58 .68 .40 .45 .70 .57 .41
P .19 .23 .22 .24 .36 .19 .27 .24 .19 .37 .28 .35 .49
Q .40 .42 .44 .39 .61 .21 .29 .50 .39 .57 .51 .62 .46 .49

Table 34 below presents a confirmatory factor analysis of the correlations between aptitudes
measured by the GATB battery by Hunter (1983). The results of the confirmatory factor analysis
show that the aptitudes break into three clusters: (a) cognitive, (b) perceptual, and (c) psychomotor.
This is complimentary of studies computing correlations between validity coefficients for aptitudes
measuring the same abilities.

Table 34. Confirmatory factor analysis between factors and aptitudes.


Factors
Cognitive Perceptual Psychomotor
Aptitudes VN PQ KFM
Intelligence (G) - - -
Verbal Aptitude (V) .82 .68 .32
Numerical Aptitude (N) .82 .77 .42
Spatial Aptitude (S) .59 .61 .35
Form Perception (P) .64 .81 .66
Clerical Perception (Q) .78 .81 .54
Motor Coordination (K) .48 .60 .64
Finger Dexterity (F) .25 .46 .67
Manual Dexterity (M) .19 .45 .72

64
Item and Test Development

Test Development Procedures

Note: Information below regarding item review and revisions is taken directly from Mellon, Daggett,
MacManus, and Moritsch (1996) which provides an extensive report of the development of GATB
Forms E and F.

Item Writing and Editorial Review and Screening

For the development of the GATB Forms E and F, great effort was used to develop new items. Many
more items were originally composed than ended up in the final versions. The developers analyzed
items from previous forms of the GATB and sorted them into categories based upon item difficulty.
Detailed information on the specifications and item types/content categories for each of subtests 1
through 7 is reported below.

A literature search was first performed to determine the proper procedures for conducting item
reviews and selecting participants for the review process. It was then identified what item review
instruments would be incorporated into the process and information in regard to three issues: bias
guidelines, procedural issues, and rating questions.

Preliminary Review. Draft versions of item sensitivity review questions, instructions, and an answer
form were sent to Assessment and Research Development (ARD) centers for review. Based on the
comments, Assessment and Research Development of the Pacific (ARDP) staff revised draft versions
of the sensitivity review materials and sent them to Assessment Research and Development Centers
(ARDCs) for further review. The only revision was a minor change in the answer form.

Pilot Test. A pilot test was conducted in-house with three Cooperative Personnel Services
(CPS) staff members, enabling individuals who were not involved in the ARDP test research program
to provide input to the review process. The results led to a number of modifications in procedures,
instructions, and documents that would be used for the item review.

Item Review Materials. Nine documents were used in the item review process: (1) a list of the criteria
to select panel members, (2) a confidentiality agreement, (3) a description of the
GATB tests and aptitudes, (4) written instructions for panel members, (5) the administrator’s version
of the written instructions for panel members, (6) a list of characteristics of unbiased test items, (7) a
list of the review questions with explanations, (8) an answer form, and (9) an answer form
supplement.

Panel Member Characteristics. Seven panel members participated in the review. The panel
included two African Americans, three Hispanics, and two whites. Three members were male and four
female. Three members were personnel analysts, two were university professors in counselor
education, one was a personnel consultant, and one was a postdoctoral fellow in economics.

Procedures. At an orientation meeting held at each of the three participating ARDCs, confidentiality
agreements were signed, GATB items and instructions were given to panel members, and several
items in each test were reviewed and discussed. Panel members reviewed the remaining items at
their convenience. After all items were reviewed, a follow-up meeting was held at each center to
resolve any problems and to discuss the review process.

A. Name Comparison. The 400 Name Comparison items were developed to be parallel to Form
A items and representative in terms of gender and ethnicity. The number of items with names
that were the same was equal to the number of items with different names. Item sources

65
included directories, dictionaries, and item developer creativity. Analyses were then
performed to develop preliminary estimates of item difficulty. Based on these analyses, the
number of characters in the left-hand column of the two-column format used for this test was
selected as the item difficulty measure. The 200 items for each form were divided into four 50-
item quarters of approximately equal estimated overall difficulty. The item order was then
randomized within each quarter.
a. Review and Screening
i. Comments focused on racial, ethnic, and gender stereotyping and
representation. Specific concerns included the lack of female and minority
businesses, and the need for more females in nontraditional professions,
jobs, and businesses.
b. Content Revision
i. The revisions addressed the racial, ethnic, and gender stereotyping and
representation criticisms. Guidelines based on the 1990 U.S. Census were
used to increase racial/ethnic and gender representation. Stereotyping was
addressed by including items with minorities and females in nontraditional
occupations and businesses; more professional occupations and businesses
were included. Fewer items with Germanic names were used. Format
changes included separating the items into blocks of five, eliminating
horizontal lines, and increasing the horizontal and vertical space within and
between items. Finally, the instructions were reworded to increase clarity;
bold and italicized types were used for emphasis.

B. Computation. The 136 Computation items were developed to be parallel to Forms A-D. The
original items were developed and reviewed to evaluate difficulty. The number of digits across
numbers within each type of operation was used as the item difficulty measure. The 68 items
for each form were divided into four 17-item quarters of equal estimated overall difficulty.
Type of arithmetic operation and response options were balanced within each quarter. A low-
difficulty item was assigned to the first position within each quarter with the remaining items
ordered randomly.
a. Review and Screening
i. Comments primarily dealt with item characteristics. Specific concerns
included difficult and time-consuming problems that might be skipped by
testwise applicants, poor distractors, and unclear instructions.
b. Content Revision
i. Distractors were revised to make them more plausible based on five error
types. Minor format changes included adding commas to numbers with at
least four digits and placing the operation sign within the item. Finally, the
instructions were reworded slightly to increase clarity, and bold and italicized
types were used for emphasis.

C. Three-Dimensional Space. The 130 Three-Dimensional Space items were developed to be


similar in content to prior forms. The number of folds was used as a measure of item difficulty;
it had six levels. Newly developed items were grouped according to the number of folds so
that an equal number of items would be developed for each of the six difficulty levels. Items
were then drawn on a computer, using the CADD-3 software package. Items were continually
reviewed for clarity and correctness, and shading was added. Completed items were
transferred to Mylar paper and reduced in size photographically, and then plates were made
for printing. Items were reviewed again and revised when necessary. Items were then
assigned to forms on the basis of difficulty, and response options were checked and tallied.
Option positions were changed as necessary. The items were rephotographed and printed.

The ARDP used the CorelDRAW! 4 graphics package (Corel Corporation, 1993) to redraw all
of the items to make them consistent in appearance. Camera-ready copies of the reformatted

66
items were prepared and sent to a graphic artist for proofing. Some of the items were later
revised to correct the problems identified by the graphic artist. Three difficulty levels were
identified based on the number of folds and/or rolls made in each item. These difficulty values
were then used to form three 16-item quartiles and one 17-item quartile of approximately
equal estimated overall difficulty within each form. Within each quartile a low-difficulty item
was assigned to the first position with the order of the remaining items randomized. The
correct response option frequencies were balanced within each quartile.
a. Review and Screening
i. Comments concerned possible gender bias and item characteristics.
Comments included the presence of male-oriented items and abstract items
that might be unfamiliar to females, difficult and time-consuming items that
could be skipped by testwise applicants, gender-biased instructions, and
overly complicated items.
b. Content Revision
i. Individual items were revised when needed to increase clarity. Revisions
were reviewed by a graphics expert familiar with the test format and the
drawing software to ensure that the items were free of errors. Instructions
were reworded slightly to increase clarity and eliminate possible gender bias;
bold and italicized types were used for emphasis.

D. Vocabulary. The 160 Vocabulary items were developed to be parallel to Form B. Item review
also focused on word difficulty but used a different approach from previous GATB
development efforts. Specifically, The Living Word Vocabulary (Dale & O’Rourke, 1981)
provided estimates of item difficulty. This reference assigns a grade level to each word
meaning. The assigned grade level is based on the responses of students who completed
vocabulary tests during the period of 1954-1979. When multiple word meanings were
reported for a given word, the average grade level was used. Higher grade levels indicated
greater difficulty. The mean of the reported grade levels for the four words that made up each
item was used to estimate item difficulty. Four difficulty level categories were formed. These
categories were used to prepare four 20-item quartiles of equal estimated overall difficulty for
each form. For each quartile, the two items with the lowest estimated difficulty appeared in the
first two positions with the order of the 18 remaining items randomized. The correct response
option frequency distributions were balanced within quartiles and forms.
a. Review and Screening
i. Comments concerned high reading grade level, overly difficult words; words
with different meanings for different groups; and inclusion of foreign-language
words and technical, biological, and scientific terms.
b. Content Revision
i. Words were replaced on the basis of the item review panel member
comments and on an analysis of word difficulty in Dale and O’Rourke (1981).
Items were modified as needed to ensure that each item’s level of word
difficulty was appropriate, word forms within items were identical, and the
same type of correct response (i.e., synonym or antonym) was maintained
within each item. The item format was changed from horizontal to vertical
ordering of words. Finally, the instructions were reworked (e.g., bold and
italicized types were used to emphasize important points, and a statement
was added stressing that all choices should be considered before selecting
an answer).

E. Object Matching. The 163 original Object Matching items were developed to be parallel to
Forms A-D. The ARDP used the number of shaded areas in the four response alternatives for
each item to estimate difficulty level. Difficulty level, content considerations, and location of
the correct response were used to form four 20-item quartiles of similar overall difficulty for
each form. (Three items were deleted.) The item order was randomized within each quartile.

67
A surplus item was then added to each quartile to form three seven-item pages that could be
shifted to meet the requirements of the research design.
a. Review and Screening
i. Comments focused mainly on possible gender bias due to differences in
familiarity and the presence of male-oriented items. However, concerns were
also expressed that items with electrical and mechanical components might
cause problems for minorities due to lack of familiarity and opportunity to
learn. Other comments concerned clarity of instructions and positioning of the
response letters for the item alternatives.

b. Content Revision
i. Item revisions included eliminating inconsequential differences among item
responses, eliminating duplicate responses, and refining responses (e.g.,
removing extraneous matter, drawing sharper lines, eliminating broken lines).
Finally, the instructions were reworded slightly to increase clarity; bold and
italicized type was used for emphasis, and the test name was changed from
Tool Matching to Object Matching. (Future forms will include more generic
items even though the results from item analyses indicated that female
scores are slightly higher than male scores on the current items.)

F. Arithmetic Reasoning. The 66 Arithmetic Reasoning items were developed to be parallel to


Form A. New situations, contemporary monetary values, gender representation, exclusion of
extraneous information, and a sixth-grade reading level were additional considerations in item
development. The ARDP reviewed and revised the items so they conformed more closely to
the guidelines for development. Item difficulty was estimated by the number of operations
needed to solve the problem, the type(s) of operations, and the number of digits included in
the terms used in the operation(s). One of the two least difficult items was assigned to the first
item position in Form E and the other item assigned to Form F. The remaining 64 items were
then assigned to four eight-item quartiles for each form on the basis of difficulty, type (s) of
operation(s), correct response key, and content. The items in each quartile were ordered from
least to most difficult with the item order then randomized within each quartile.
a. Review and Screening
i. Most comments were directed toward two areas: (1) racial, ethnic, and
gender representation, and (2) gender occupational and activity stereotyping.
Other comments concerned time-consuming items that might be skipped by
testwise applicants, confusing and incomplete instructions, the presence of
items that were overly complicated or involved too many steps, and some
groups not having the opportunity to learn how to perform the operations
needed to answer the complex items.
b. Content Revision
i. Revisions involved four areas: making minor item format modifications,
eliminating gender stereotyping, making the distractors more plausible, and
increasing racial, ethnic, and gender representation. The instructions were
reworded slightly to increase clarity; bold and italicized types were used for
emphasis.

G. Form Matching. The 200 Form Matching items were developed to be parallel to Forms A-D
items in terms of content and parallel to Form A item size and arrangement. Eight 25-item
blocks were developed by modifying each of the eight blocks of items in Forms A-D. The
number of response options for each item was reduced from 10 to five (Form Matching was
later removed for the final versions of Forms E and F).
a. Review and Screening
i. Comments included a possible practice effect for the test and unclear
instructions because of reading level. Comments that were directed toward

68
specific items included linear illustrations being perceived as “hostile,” minute
differences among shapes, and possible confusion due to shape similarity
and location.
b. Content Revision
i. Changes included enlarging figures to increase clarity, repositioning items to
equalize space among items in the lower blocks, and revising an item family
to make it less similar to another item family. The number of response
options was reduced from 10 to 5. Finally the instructions were reworded
slightly to increase clarity; bold and italicized types were used for emphasis.

Item Tryout and Statistical Screening. The item pretest and analysis had two goals: (1) conducting
an item analysis to obtain preliminary difficulty and discrimination indices, and (2) obtaining a
quantitative estimate of ethnic and gender performance differences for each item. The sample
comprised 9,327 applicants from USES local offices in the five geographic regions represented within
the ARDP. Data were obtained by administering to the sample members 16 test booklets comprising
three speeded tests and one power test. Each sample member completed one test booklet. Classical
test theory item analyses were performed for the speeded test items. Item selection criteria included
difficulty, discrimination, and content considerations. IRT procedures were used for the power test
items. The analyses included dimensionality, position effects, item and test fairness, and test
information graphs. Item DIF analyses were also performed with Mantel-Haenszel procedures. IRT
procedures were used for test-level DIF analyses.

Construction of the Final Version of Forms E and F. After items were screened and calibrated, a
final set of items was selected for each Form E and Form F test. Items were selected to yield forms as
parallel to each other as possible with respect to content coverage, difficulty, and test information. The
forms were also balanced on subgroup difference statistics so that no one form provided any relative
disadvantage to females, African Americans, or Hispanics. Insofar as possible, the power tests were
also designed to be similar with respect to difficulty and information to Form A, after adjusting for
differences in test lengths.

Psychometric Model

Classical test theory item analyses were performed for the speeded test items. Item selection criteria
included difficulty, discrimination, and content considerations. IRT procedures were used for the
power test items. IRT approaches involve estimating separate item characteristic curves (ICCs) for
the two groups being compared. The analyses included dimensionality, position effects, item and test
fairness, and test information graphs. Item DIF analyses were also performed with Mantel-Haenszel
procedures. IRT procedures were used for test-level DIF analyses. They evaluated the dimensionality
of the power tests and performed computer analyses to estimate Item Response Theory (IRT)
parameters, in addition to conducting a preliminary selection of items. They used this information, in
conjunction with their own analyses for assessing differential item functioning (DIF), to select the final
items for the new forms.

Equating Multiple Forms

In order to equate the new forms, Forms E and F, a study was implemented to equate them with Form
A. They collected a nationwide sample of data for 8,795 individuals that were representative of
applicant populations. The technical requirements and foundation for these procedures reported
below are presented in Segall and Monzon (1995).

Data Collection Design

Three samples of data were collected for the GATB Forms E and F equating study: (a) the first
describes data collection design for the independent-groups sample which was used to equate the

69
new forms with the old GATB forms, (b) the second sample was a repeated-measures sample which
was used for comparing the reliability and construct validity of the new and old forms, as well as the
equating analysis, and (c) the third data collection design was for the psychomotor sample which was
used for examining the need for composite equatings and also to examine construct validity of the
psychomotor tests.

For the independent-groups sample, 5,892 participants were randomly assigned to one of three forms
(i.e., A, E, or F). Because the old forms (Form A) and the new forms (Forms E and F) are composed
of different test orders, time limits, and instructions, participants were randomly assigned to one of the
test versions at their testing location. Therefore, participants were assigned the test versions as a
group as the different versions could not be administered to a single group of participants at the same
time. Therefore, groups had to be physically separated during administration of the forms.

The repeated-measures sample was administered two forms of the GATB (Forms A, B, E and F), and
this sample was primarily used for examining the reliability and construct validity of the GATB, and
also to supplement the equating data. Each of the participants was randomly assigned to one of eight
conditions. Table 35 shows the conditions to which participants were assigned.

Table 35. Repeated-measures design and sample sizes.


Second Test
First Test A B E F
A 1 (411) 3 (236)
B 2 (432) 5 (209)
E 6 (215) 7 (446)
F 4 (216) 8 (446)

For the psychomotor subtests, 538 participants were sampled and received each of the five
psychomotor subtests (i.e., subtests 7 through 11) and also the non-psychomotor subtests (i.e.,
subtests 1 through 6). Participants were randomly assigned to one of two groups, each receiving (a)
Form A (non-psychomotor), (b) Form A (psychomotor), and (c) Form F (non-psychomotor). The order
of presentation of test forms was counterbalanced across the two groups. For example, one group
received Form A (non-psychomotor) and Form A (psychomotor) in the morning, and Form F (non-
psychomotor) in the afternoon. The other group would have received the same battery, but the order
of the non-psychomotor tests of Forms A and F would have been reversed. Table 35 above shows
the order and forms of tests administered.

An evaluation of the random equivalence of selected groups within each of the three samples was
also conducted because the random equivalence of these groups is a key assumption made in the
equating, reliability, and validity analyses. These tests were conducted by gender, race, age, and
education. Each of these resulted in non-significant findings were are consistent with the expectation
based on random assignment of examinees to condition and support the assumption of equivalent
groups made in the equating, reliability, and validity analyses. This also provides assurance that the
assignment procedures worked, producing groups that are randomly equivalent with respect to
demographic characteristics.

Reliability Analysis

Because the new test forms have fewer items and longer time limits, it was needed to run a reliability
analysis to determine the precision with which the new forms, Forms E and F, measure their
purported abilities. Four groups were used from the repeated-measures sample. Fisher’s z
transformation was used to test the significance of the difference between the alternate form
correlations of the new and old GATB forms. As described in Cohen and Cohen (1983). The alternate
form reliabilities of the new GATB forms, Forms E and F, are generally as high as, or higher than,
those of the old GATB Forms A and B. This was significant as the length of the three power tests was

70
decreased. The increase in testing time, however, may have added to the reliability of these power
tests, offsetting the detrimental effects of shortening test lengths.

Because the new test forms have fewer items and longer time limits, new reliability studies were
needed. Both new forms were administered to a large sample of test takers, as were both Forms A
and B. Composites of subtest scores were formed for several ability measures shown in Table 36.
The alternate forms reliabilities reported in Table 36 are the reliabilities of these composite scores.
Tests of the difference between new form reliabilities and old form reliabilities showed that Forms E
and F were generally as reliable as or more reliable than Forms A and B. The increase in testing
time, however, may have added to the reliability of these power tests, offsetting the detrimental effects
of shortening test lengths.

Table 36. Alternate forms reliability estimates for Forms A and B and for Forms E and F.
Ability Reliability Significance Test
Composite RE, F (N = 870) RA, B (N = 820) z p
General Learning
.908 .886 2.284 .011
Ability
Verbal Aptitude
.850 .858 -.580 .281
Numerical
.876 .884 -.704 .241
Aptitude
Spatial Aptitude
.832 .805 1.648 .050
Form Perception
.823 .824 -.049 .480
Clerical
.778 .755 1.145 .126
Perception
* RE, F = new forms, RA, B = old forms

Overall Form Linking Study Results. Equating of the new GATB forms to the old forms proved
successful. Despite the changes made in the new test forms, the evidence suggests that there is
sufficient similarity to obviate the need for separate composite equating tables for the
nonpsychomotor composites. Average subgroup performance levels are similar across the old and
new forms, and reliabilities of the new GATB forms are generally as high as or higher than those of
the old forms. Construct validity analyses of the old and new forms suggest that the GATB validity
data can continue to be used for the new forms.

Item Banking Approaches

While IRT was used in the development of the power tests, no information was reported about how
item banking approaches were used in the development or maintenance of the GATB Forms E and F.
It is likely that new items were stored in an item bank, but the methods by which Forms E and F, as
well as earlier Forms A and B, were administered in a fixed form mode indicate that no feature of the
administration of these fixed forms relied on the ongoing use of banked item characteristics.

Criterion Validity

Validity across Job Families

By the early 1980’s over 500 criterion validity studies had been conducted with GATB subtests. To
organize comprehensive meta-analyses of these studies (Hartigan & Wigdor, 1989; Hunter, 1983),
three composites of GATB subtests were created; (a) cognitive ability, (b) perceptual ability, and (c)
psychomotor ability. In addition, broad job families were identified to organize meta-analyses of these
studies to evaluate the generalizability of GATB criterion validity. Included in this study are 515
validation studies that were performed for the U.S. Employment Service. Of the 515 validation

71
studies, 425 used a criterion of job proficiency while the remaining 90 used a criterion of training
success.

Table 37a-37c shows the variation in observed validities for each of the three ability composites either
by the distribution of observed validity coefficients, the distribution of validities corrected for sampling
error, and the distribution of validity coefficients corrected for the artifacts of range restriction and
criterion unreliability. In these tables, the variation of coefficients is across the entire job spectrum.
Much of the variation in Table 37a is due to sampling error in the observed validity coefficients.
Overall, Table-37c shows that the three GATB composites are valid predictors of job performance
across all jobs, but the level of true validity varies across jobs.

Table 37a. Distribution of observed validity coefficients across all jobs.


Cognitive Ability Perceptual Ability Psychomotor Ability
Mean Observed
.25 .25 .25
Correlation
Observed SD
.15 .15 .17
th
Observed 10
.05 .05 .03
Percentile
th
Observed 90
.45 .45 .47
Percentile

Table 37b. Distribution of observed validity coefficients less variance due to sampling error.
Cognitive Ability Perceptual Ability Psychomotor Ability
Mean Observed
.25 .25 .25
Correlation
SD Corrected for
Sampling Error
.08 .07 .11
Variance
th
Observed 10
.15 .16 .11
Percentile
th
Observed 90
.35 .34 .39
Percentile

Table 37c. Distribution of true validity (corrected for range restriction and criterion unreliability) across
all jobs.
Cognitive Ability Perceptual Ability Psychomotor Ability
Mean Corrected r’s .47 .38 .35
SD of Corrected r’s
.12 .09 .14
th
Observed 10
.31 .26 .17
Percentile
th
Observed 90
.63 .50 .53
Percentile

Validity Generalization Study

This accumulation of studies supports two general conclusions are supported from these studies is
that the validity of reliable cognitive ability tests does not vary much across settings or time, and
construct valid cognitive ability tests have some positive validity for all jobs (Hunter, 1983). Table 38
shows the average true validity for training success and job proficiency for each of the three cognitive
ability composites.

Table 39 shows that validity changes with the complexity of job content. As complexity of a job
decreases (i.e., 5 being lowest), the validity of cognitive ability also decreases but not to zero.
Cognitive validity is shown to be positively related to job complexity. At the same time, psychomotor
ability has its highest validity with respect to proficiency where complexity is lowest, Table 39 also
72
provides the results for training success. Across all levels of job complexity, the validity of cognitive
ability for predicting training success is high. Overall, cognitive ability predicts job proficiency across
all jobs, but the validity does drop off for low levels of job complexity.

Table 38. Average true validity.


Cognitive Perceptual Psychomotor
Study Type # of Jobs Ability Ability Ability Average
Training
90 .54 .41 .26 .40
Success
Job
425 .45 .37 .37 .40
Proficiency
Average
515 .47 .38 .35 .40

Table 39. Average true validity as a function of job complexity.


Complexity Job Proficiency Training Success
Level GVN SPQ KFM GVN SPQ KFM
1 .56 .52 .30 .65 .53 .09
2 .58 .35 .21 .50 .26 .13
3 .51 .40 .32 .57 .44 .31
4 .40 .35 .43 .54 .53 .40
5 .23 .24 .48 - - -
Average .45 .37 .37 .55 .41 .26
*GVN = Cognitive Ability, SPQ = Perceptual, Ability, KFM = Psychomotor Ability

Table 40 shows that differential prediction is very effective if psychomotor ability is included in the
composite. With respect to job complexity, psychomotor ability is at its highest where the validity of
cognitive ability is at its lowest. Multiple regression equations results in multiple correlations that vary
from one complexity level to another yield a higher overall level of validity than would be the case for
any single predictor. Using cognitive ability to predict proficiency at the three higher complexity levels
and psychomotor ability to predict the two lower levels raises the average validity. Thus, using
psychomotor ability in combination with cognitive ability leads to an increase in validity as a function of
job complexity.

Table 40. Validity of ability combinations for job proficiency: Best single predictor and, and two sets of
multiple regression weights with multiple correlation.
Best Beta Weights Beta Weights
Complexity Single
Level Predictor GVN SPQ KFM R3 GVN KFM R2
1 .56 .40 .19 .07 .59 .52 .12 .57
2 .58 .75 -.26 .08 .60 .58 .01 .58
3 .51 .50 -.08 .18 .53 .45 .16 .53
4 .43 .35 -.10 .36 .51 .28 .33 .50
5 .48 .16 -.13 .49 .49 .07 .46 .49
Average .48 .42 -.09 .27 .51 .37 .24 .52
*GVN = Cognitive Ability, SPQ = Perceptual, Ability, KFM = Psychomotor Ability
**R3 = multiple correlation with all three ability composites; R2 = multiple correlation with GVN and
KFM

73
Table 41. Validity of ability combinations for training success: Best single predictor and, and two sets
of multiple regression weights with multiple correlation.
Best Beta Weights Beta Weights
Complexity Single
Level Predictor GVN SPQ KFM R3 GVN KFM R2
1 .65 .57 .21 -.21 .68 .70 -.16 .66
2 .50 .72 -.30 .03 .53 .52 -.05 .50
3 .57 .57 -.07 .15 .58 .53 .13 .59
4 .54 .34 .17 .20 .59 .46 .24 .59
5 - - - - - - - -
Average .55 .59 -.10 .11 .57 .53 .08 .57
*GVN = Cognitive Ability, SPQ = Perceptual, Ability, KFM = Psychomotor Ability
**R3 = multiple correlation with all three ability composites; R2 = multiple correlation with GVN and
KFM

Overall, the results show that composites of GATB subtests are valid in predicting both job proficiency
and training success across all jobs. The level of GATB validity varies as a function of job complexity.
Even after jobs are stratified by complexity, there is some variation in validity. Some of this variation is
most likely due to artifacts in the study itself. However, other variance may be due to unknown job
dimensions other than the overall job complexity dimension.

Hartigan and Wigdor (1989) also computed validity generalization analyses for the GATB. They had
access to the full data set, which were a few years after Hunter (1983) conducted his studies. A
larger number of GATB validation studies had been conducted on the GATB since 1983. Hartigan
and Wigdor also found the GATB validity to generalize across all jobs. Although, the validities for the
new studies were found to be smaller than those reported by Hunter (1983). Several theories exist for
these differences such as smaller sample sizes, the use of different criteria, and possibly jobs
surveyed in the newer studies. Overall, the newer studies provided sufficient generalization validity
for the use of the GATB across all jobs in the U.S. Table 42 and Table 43 below show the differences
in validity across job families, abilities, and criteria types, along with the differences between the
Hunter (1983) study and the Hartigan and Wigdor (1989) validity generalization results.

Table 42. Variation of validities across job families in 1983 (N = 515) and 1989 (N = 264) studies.
Abilities
GVN SPQ KFM
Hunter NRC Hunter NRC Hunter NRC
Job Family (1983) (1989) (1983) (1989) (1983) (1989)
I (set-up/
.34 .16 .35 .14 .19 .08
precision)
II (feeding/
.13 .19 .15 .16 .35 .21
offbearing
III
.30 .27 .21 .21 .13 .12
(synthesizing)
IV (analyze/
compile/ .28 .23 .27 .17 .24 .13
compute)
V (copy/
.22 .18 .24 .18 .30 .16
compare)
*GVN = Cognitive Ability, SPQ = Perceptual, Ability, KFM = Psychomotor Ability

74
Table 43. Validities for the two sets of studies by job family and type of criterion.
Performance Training
Hunter NRC Hunter NRC
(N) (N) (N)
Job Family (1983) (1989) (1983) (1989) (N)
GVN
I (set-up/
.31 1,142 .15 3,900 .41 180 .54 64
precision)
II (feeding/
.14 1,155 .19 200 - - - -
offbearing
III
.30 2,424 .25 630 .27 1,800 .30 347
(synthesizing)
IV (analyze/
compile/ .27 12,705 .21 19,206 .34 4,183 .36 3,169
compute)
V (copy/
.20 13,367 .18 10,862 .36 655 .00 106
compare)
SPQ
I (set-up/
.32 .13 .47 .40
precision)
II (feeding/
.17 .16 - -
offbearing
III
.22 .21 .18 .21
(synthesizing)
IV (analyze/
compile/ .25 .16 .29 .25
compute)
V (copy/
.23 .18 .38 .01
compare)
KFM
I (set-up/
.20 .07 .11 .16
precision)
II (feeding/
.35 .21 - -
offbearing
III
.17 .17 .11 .02
(synthesizing)
IV (analyze/
compile/ .21 .12 .20 .17
compute)
V (copy/
.27 .16 .31 .12
compare)

Approach to Item / Test Security

The historical use of GATB for selection managed test security by requiring all administrations to be
proctored by trained and qualified test administrators. To our knowledge, no routine processes were
in place to monitor item characteristics or to produce alternative form out of concern for item or test
security.

Translation / Adaption

The GATB has been transcribed into 13 languages in 35 countries, including Arabic. Dagenais
(1990) conducted a study on the GATB once translated from English to Arabic and used in Saudi
Arabia. Samples of participants from the U.S. and Saudi Arabia were used and contrasted. The
resulting factor structures were analyzed and found a three factor structure (i.e., cognitive, spatial
perception, and psychomotor) for both samples, therefore establishing factor equivalence. The mean
scores for both groups were similar in shape and amplitude. Thus, the overall findings supported the
conclusion that GATB could be translated into Arabic and administered in Saudi Arabia. Table 44
shows the factor loadings for the GATB in the U.S. and Saudi Arabia.
75
Table 44. Factor loadings for Saudi Arabian (SA) and American (US) GATB scores.
Factor 1 Factor 2 Factor 3 Spatial
Cognitive Psychomotor Perception Communalities
Subtest SA US SA US SA US SA US
Computation
.80 .84 .08 .06 .10 .20 .65 .95
Name
.74 .87 .16 .23 .11 .02 .58 .80
Comparison
Arithmetic
.69 .76 .00 -.02 .33 .41 .57 .74
Reasoning
Vocabulary
.56 .79 .01 -.02 .51 .29 .57 .70
Object
.55 .60 .27 .33 .42 .27 .55 .54
Matching
Mark Making .47 .64 .39 .53 -.16 -.26 .36 .76
Man. Dex.:
.07 .13 .77 .78 .10 -.10 .61 .63
Turn
Man. Dex.:
.20 .02 .73 .74 .07 .13 .58 .56
Place
Fing. Dex.:
.10 .16 .72 .74 .18 .10 .57 .59
Disassem.
Fing. Dex.:
.03 .06 .71 .67 .18 .23 .54 .51
Assem
Three-D
.10 .38 .16 .16 .80 .81 .70 .82
Space
Form
.21 .52 .22 .23 .68 .48 .58 .61
Matching
I II III Sum
Eigenvalues
4.14 5.17 1.72 1.99 1.00 .85 6.86 8.01
% of Total
.35 .43 .14 .17 .08 .07
Variance
Cumulative
.35 .43 .49 .60 .57 .67
% Tot. Var.
% of
Common .38 .48 .36 .34 .26 .18
Var.

User Support Resources

Note: The GATB is out of print in the U.S., but it is still used in Canada (i.e., Forms A and B) for
vocational counseling, rehabilitation, and occupational selection settings. Nelson in Canada can
supply such resources as:

 Administration and scoring manuals


 Application and interpretation manuals
 Test booklets
 Answer sheets
 Dexterity boards
 Scoring masks
 Recording sheets
 Examination kits
 Score conversion/reporting software
 Interpretation aid charts

76
Evaluative Reviews

Not only is the GATB the most extensively researched multiaptitude cognitive ability battery, the
GATB is also the only cognitive ability battery that is linked to 800 plus occupations in the U.S., in the
US O*NET system. Overall, GATB studies have demonstrated (a) GATB composites have high levels
of reliability, (b) validity generalization studies have shown it to be a valid predictor of job performance
for all jobs in the U.S. economy, and (c) Forms E and F were developed to race/ethnic and gender-
related sources of bias (Keesling, 1985). However, two limitations are reported: (a) because of the
speeded subtests, individuals with disabilities may be penalized, and (b) the norms for the GATB are
dated.

INFORMATION ABOUT THE US OFFICE OF PERSONNEL


MANAGEMENT’S NEW CIVIL SERVICE TEST SYSTEM, USA HIRE
In the course of investigating the seven selected batteries, the new civil service testing system
recently developed by the US Office of Personnel Management was identified as a useful
comparison. This new system includes at least four newly developed cognitive ability tests. (An
additional assessment, the Interaction test is a personality assessment.) Unfortunately, at the time of
this Report, OPM was making no information about these tests available except what was already
accessible online at https://www.usajobsassess.gov/assess/default/sample/Sample.action. This site
provides a sample item for each new test. Screen shots of these sample items are shown here for
reference below.

Occupational Math Assessment


In this assessment, you will be presented with multiple-choice questions that measure
your arithmetic and mathematical reasoning skills. You will be asked to solve word
problems and perform numerical calculations. You will also be asked to work with
percentages, fractions, decimals, proportions, basic algebra, basic geometry, and basic
probability. All of the information you need to answer the questions is provided in the
question text. Knowledge of Federal rules, regulations, or policies is NOT required to
answer the questions.

You MAY use a calculator and scratch paper to answer the questions.
Read the questions carefully and choose the best answer for each question. Once you
have selected your response, click on the RECORD ANSWER button. You will not be
able to review/change your answers once you have submitted them. This
assessment contains several questions. For each question, you will have 5 minutes
to select your
answer. A sample question is shown below.

77
Occupational Math Example

Solve for x.

3x - 3 = 6

12

Occupational Judgment Assessment


In this assessment, you will be presented with a series of videos that are scenarios
Federal employees could encounter on the job. For each scenario, you will be presented
with four possible courses of action for responding to the scenario presented in the
video. You will be asked to choose the response that is the MOST effective course of
action and the response that is the LEAST effective course of action for that particular
scenario.

When you begin each scenario, the upper left half of the screen contains important
information for you to review prior to watching the video. Below this information you will
find the question. This question tells you from which perspective you should
respond to the scenario. The video appears in the upper right. The bottom half of the
screen contains the four courses of action for responding to the scenario.
The video can be viewed in closed captioning by clicking on the CLOSED CAPTIONING
button located to the left of the VOLUME button.
After watching the video and reading through the four courses of action, click on the
button under Most Effective to choose the course of action you consider the best for that
situation. Do the same for the course of action you consider the worst by clicking on the
button under Least Effective.
This assessment contains several scenarios. For each scenario, you will have 5
minutes to select both the MOST and LEAST effective courses of action.
A sample question is shown below. To view the sample video, click the PLAY button in
the lower corner of the video box. To view the video in closed captioning, click on the
CLOSED CAPTIONING button located to the left of the VOLUME button.

78
Occupational Judgment Example Progress:

Watch the following video. Choose the most AND least effective course of action from the options
below.

Step 1. Scenario
Barbara and Derek are coworkers. Barbara has just been
provided with a new assignment. The assignment requires the
use of a specific computer program. Derek walks over to
Barbara's cubicle to speak to her.
If you were in Barbara's position, what would be the most and
least effective course of action to take from the choices below?

Step 2. Courses of Action Most Effective Least Effective


Try to find other coworkers who can explain how to use the
new program.
Tell your supervisor that you don't know how to use the
program and ask him to assign someone who does.
Use the program reference materials, tutorial program, and
the help menu to learn how to use the new program on your
own.
Explain the situation to your supervisor and ask him what to
do.

Occupational Interaction Assessment


In this assessment, you will be presented with questions that ask about your interests
and work preferences. Read each question carefully, decide which of the five possible
responses most accurately describes you, and then click on that response. Once you
have clicked on that response, click on the RECORD ANSWER button.
The possible responses vary from question to question. The responses will assess:

1) How much you agree with a statement;

2) How often you do, think, or feel things; or

3) How much or how often you do, think, or feel things compared to others.
This assessment is not timed.
A sample question is shown below.

Occupational Interaction Example Progress:


79
If I forget to fill out a form with information that others need, I make sure to follow up.

| | | |
Almost always Often Sometimes Rarely Never

Occupational Reasoning Assessment


In this assessment, you will be presented with multiple-choice questions that measure your reasoning skills. You will
be asked to draw logical conclusions based on the information provided, analyze scenarios, and evaluate arguments.

All of the information you need to answer the questions is provided in the question text. Knowledge of Federal rules,
regulations, or policies is NOT required to answer the questions.

Read the questions carefully and choose the best answer for each question. Once you have selected your response
click on the RECORD ANSWER button.

This assessment contains several questions. For each question, you will have 5 minutes to select your answer

A sample question is shown below.

Occupational Reasoning Example


All documents that contain sensitive information are considered to be classified. Mary has drafted a
report that contains sensitive information.

Based on the above statements, is the following conclusion true, false, or is there insufficient
information to determine?
The report drafted by Mary is considered to be classified.
True

False

Insufficient Information

Occupational Reading Assessment


In this assessment, you will be presented with multiple-choice questions that measure your reading
comprehension skills. You will be asked to read a passage or table and to answer questions based on the
information provided in the passage or table. All of the information you need to answer the questions is
80
provided in the question text. Knowledge of Federal rules, regulations, or policies is NOT required to answer the
questions.

Read the questions carefully and choose the best answer for each question. Once you have selected your response
click on the RECORD ANSWER button.

This assessment contains several questions. For each question, you will have 5 minutes to select your answer

Occupational Reading Example


The job grade definitions establish distinct lines of demarcation among the different levels of work
within an occupation. The definitions do not try to describe every work assignment of each position
level in the occupation. Rather, based on fact finding and study of selected work situations, the
definitions identify and describe those key characteristics of occupations which are significant for
distinguishing different levels of work.

In the context of the passage, which one of the following could best be substituted for distinct lines
of demarcation without significantly changing the author's meaning?

agreement

comparisons

connections

boundaries

81
DESCRIPTIVE RESULTS FOR PET
Overall Description and Uses

Introduction and Purpose

PSI’s Professional Employment Test (PET) battery consists of four subtests that measure three major
cognitive abilities.

The PET was developed in 1986 and has been used extensively for selection purposes with
occupations requiring a high-level of education. This battery was developed to assess specific
abilities that have been found to be important in performing entry-level, professional, administrative,
managerial, and supervisory jobs. PET was initially developed for use with state and local
government jobs. PSI subsequently expanded its use to this same range of jobs within the private
sector. These abilities have been well-established in the personnel selection validity literature
demonstrating the importance for performance in professional and managerial occupations.

The Data Interpretation subtest was designed to measure quantitative problem solving ability and
reasoning ability. Reasoning measures verbal comprehension ability and reasoning ability in the form
of syllogisms. The Quantitative Problem Solving subtest was designed to measure quantitative
problem solving ability. Reading Comprehension was designed to measure verbal comprehension
ability.

Purpose of Use

The Professional Employment Test (PET) was designed for use as a selection instrument appropriate
from applicants applying for jobs that usually require a college education. It is suited for selection
purposes in higher level occupations, more so than many other tests of GMA, because of its complex,
work-related content and high reading level. The PET was also created with the intention of easy
administration, scoring, and report generation.

Targeted Populations

The populations targeted by the PET are candidates who are working adults for entry-level,
professional, administrative, and managerial, and supervisory occupations as potential professional
employees in both public and private sector jobs. These occupations usually require a college-level
education. To support this intended use, all items were written at a reading difficulty level of not
greater than the fourth year of college (Grade 16), with item content representative of content
commonly encountered in business and government work settings.

Target Jobs/Occupations

Normative information for the PET is presented in the technical manual for three major occupation
groups; (a) Facilitative, (b) Research and Investigative, and (c) Technical and Administrative.

A. Facilitative
a. Jobs such as Buyer, Eligibility Worker, Human Service Worker, and Rehabilitation
Counselor that determine the need for services or supplies and identify and apply the
appropriate rules and procedures to see that needs are met.
i. An example of a job task is obtaining necessary or required
verification/authentication from records and from various sources.

82
B. Research and Investigative
a. Jobs such as Research Statistician, Probation and Parole Specialist, Staff
Development Specialist, Unemployment Insurance Claims Deputy, and Equal
Employment Opportunity Coordinator that provide information to decision-makers
through investigation and research methods.
i. An example of a job task is procuring required documentation and
information for new investigations.

C. Technical and Administration


a. Jobs such as Computer Programmer/Analyst, Prisoner Classification Officer, and
Support Enforcement Officer that organize information and resolve problems through
specific processes and methodologies.
i. An example of a job task is coding new programs within desired time frames.

Spread of Uses

PET was designed to for use as a selection instrument. PSI has not supported its use for other uses
such as development or career guidance.

Administrative Details

Table 45 describes the administrative features of the PET subtests.

Table 45. Administrative detail for the PET.


Administrative Detail
Subtest # Items Time Scoring Rule Delivery Methods
Data An overall raw Paper-pencil; proctored.
10 20 minutes
Interpretation score is Online (ATLAS™);
Reasoning 10 20 minutes generated with a proctored and unproctored.
percentile score.
Quantitative
10 20 minutes
Problem Solving
Reading
10 20 minutes
Comprehension

Number and Types of Items

The PET has two alternate forms, the PET-A and the PET-B, which have the same length. Both
forms have four subtests (i.e., data interpretation, reasoning, quantitative problem solving, and
reading comprehension), 10 items for each, for a total of 40 items. The short-form PET is half the
length as the long-forms, containing only 20 items and therefore only five items for each of the four
subtests. Each of the four subtests contains multiple choice questions which measure the three
abilities described above. The item types were based on work by French, Ekstrom, and Price (1963)
which established various abilities in psychometric literature such as verbal comprehension,
quantitative problem solving, and reasoning.

Time Limits

The paper and pencil version of PET has an 80-minute time limit, not including the time for reviewing
instructions. While there are four subtests with time limits, the time limits imposed on the individual
subtests are arbitrary as the only real time limit is the overall limit of the four combined subtests; 80
minutes. This is enforced because of the omnibus format. The short-form of the PET contains 20
items with a 40 minute time limit.

83
If using PSI’s proprietary online administration platform, ATLAS™, the testing time is extended by 5
minutes. This is done so that the PET remains a power (non-speeded) test.

Type of Score

The PET has the ability to be scored three different ways; (a) computer automated scoring (using
PSI’s ATLAS™ platform), (b) hand scoring using templates, and (c) on-site optical scanning/scoring.
The PET raw score is the number of items answered correctly for each subtest. Each raw score
determines a percentile score based on relevant PSI norm groups. Scores can also be banded,
pass/fail, or used in ranking of test examinees.

Test Scale and Methods of Scaling

Raw scores are number correct scores for each subtest and are transformed into percentile scores for
reporting and interpretation purposes based on the distributions of PET raw scores in relevant norm
groups based on the three occupation groups, Facilitative, Research and Investigative, and Technical
and Administrative). Items within the subtests are also ordered in ranking of difficulty such that the
first item is easier than the last item.

Method of Delivery

The PET is available in both paper-pencil administration and online administration. For online use,
the PET is administered through PSI’s own web-delivery platform, ATLAS™. The ATLAS™ web-
based talent assessment management system is an enterprise web-based Software-as-a-Service
(SaaS) solution for managing selection and assessment processes. This platform can function in a
number of capacities (e.g., configuring tests, batteries, score reports, managing test inventory,
manage and deploy proctored and unproctored tests, establish candidate workflow with or without
applicant tracking systems, etc…). One of the important aspects of administering the PET using PSI’s
own web-delivery platform, ATLAS™, is that it can be proctored online. Unproctored testing is
unavailable for the PET in the U.S., but it is available internationally for unproctored online use. As
such, PSI has developed an alternate form specifically for online unproctored use internationally.

Costs

Test Materials Cost


Tests (pkg. 25; scoring sheets included)
Long Form $445.00
Short Form $445.00
Manuals
PET Technical Manual $350.00
*Other cost information was not available after extensive efforts to attainment.

Construct-Content Information

Intended Constructs

The types of subtests included in the PET are typical of subtests in other commercially available tests
of cognitive ability. The PET is comprised of four subtests designed to measure three abilities. These
subtests consist of Quantitative Problem Solving, Reading Comprehension, Reasoning, and Data
Interpretation, which assess three major cognitive abilities (i.e., verbal comprehension, quantitative
problem solving, and reasoning). The subtests are intended to measure abilities similar to those
measured by subtests in other cognitive ability batteries but the items are written at a higher reading
level and more complex content than most with the exception of Watson-Glaser and Verify subtests.
84
PSI designed PET to be more appropriate than other batteries for use in professional, administrative,
and managerial jobs in state government. PSI conducted intensive interviews and nationwide surveys
about testing practices and searched the literature for available tests that may have already been
appropriate. The results were inconclusive and spurred PSI to compose their own test that would be
specifically relevant to professional / managerial occupations within state government organizations.
The types of subtests comprising the PET have been well established in psychometric literature (e.g.,
see French, 1951; Guilford, 1956; Nunnally, 1978) and were selected for use in this battery in view of
substantial empirical research. These abilities have also been found to be important in professional
occupations (e.g., see McKillip, Trattner, Corts, & Wing, 1977; Hunter, 1980).

The abilities and their respective definitions are listed as follows:

Verbal Comprehension is described by PSI as the ability to understand and interpret information that
is complex and imparted through language. This also includes the ability to read and understand
written materials.

Quantitative Problem Solving is described by PSI as the ability to apply reasoning processes for
solving mathematical problems. This may often be seen as a special form of general reasoning.

Reasoning is described by PSI as the ability to analyze and evaluate information for making correct
conclusions. Reasoning also includes inductive and deductive logic.

Table 46 shows the abilities assessed by the PET.

Table 46. PET subtests and the abilities the battery assesses.
Ability
Verbal Quantitative Problem
Subtest Comprehension Solving Reasoning
Data Interpretation
X X
Reasoning
X X
Quantitative Problem
X
Solving
Reading
X X
Comprehension
Item Content

The PET utilizes four subtests and is intended to measure three abilities that have been deemed
important in performing occupations such as professional, administrative, and managerial work. PET
item content was developed to include work-like content, meaning that the items in the four subtests
are reflective of content that may be similarly observed on the job in such occupations. All subtest
item content verbal content, and all items were written at a reading difficulty level of not greater than
the fourth year of college (grade 16), as measured by the SMOG index of readability (McLaughlin,
1969). Test items are based on language and situations that are commonly encountered in business
and government work settings.

Specific Item Content Examples

The PET is comprised of four subtests:


A. Data Interpretation
a. Measure quantitative problem solving ability and reasoning ability.
i. These items consist of numerical tables, where selected entries from each
table have been deleted. The examinee deduces the value of the missing

85
entries using simple arithmetic operations. The examinee then selects the
correct answer from five alternatives.
ii. Sample item unavailable

B. Reasoning
a. Used in measuring verbal comprehension ability and reasoning ability, are in the form
of syllogisms.
i. The examinee is given a set of premises that are accepted as true and a
conclusion statement. The examinee determines whether the conclusion is:
1) necessarily true; 2) probably, but not necessarily, true; 3) indeterminable;
4) probably, but not necessarily, false; or 5) necessarily false. The premises
of the questions are hypothetical situations related to work in business,
industry, and government.

C. Quantitative Problem Solving


a. Used in measuring quantitative problem solving ability.
i. These items consist of word problems. The examinee identifies and applies
appropriate mathematical procedures for solving the problem and selects the
correct answer from five alternatives. The questions reflect the kinds of
mathematical problems that might occur in business, industry, or
government.

D. Reading Comprehension
a. Used in measuring verbal comprehension ability and reasoning ability.
i. The examinee reads a paragraph and a multiple-choice question and selects
the one of five alternatives that best reflects the concept(s) of the paragraph.
The correct answer may be a restatement of the concept(s) or a simple
inference. Reading passages were based on publications and materials
dealing with common issues in society.

86
Construct Validity Evidence

Construct Validity

No construct validity evidence was reported in the technical manual or other available resources.
There are no studies or other information reported in technical manuals or published sources that
compare the subtests of the PET cognitive ability battery with other similar cognitive ability batteries.
The publisher confirmed this as well.

Item and Test Development

Item Development

Two hundred items were initially developed by five members of the PSI professional staff. The initial
version of the test consisted of 54 inference, 48 quantitative reasoning, 47 reading comprehension,
and 49 tabular completion (arithmetic such as addition and subtraction) items. The reading level was
no higher than grade 16. A subcommittee of item writers reviewed and approved the items. This
review was designed to ensure that items did not contain material that may offensive or unequally
familiar to members of any racial, ethnic, or gender group.

The four subtest items were based on the work of French, Ekstrom, and Price (1963) which
established various abilities in psychometric literature such as verbal comprehension, quantitative
problem solving, and reasoning. PSI also relied upon validity from research that evaluated these
items types as predictors of professional job performance (O'Leary, 1977; O'Leary & Trattner, 1977;
Trattner, Corts, van Rijn, & Outerbridge, 1977; Corts, Muldrow, & Outerbridge, 1977). After
development of the items, they were pretested and subjected to statistical analyses to cast test forms
to meet psychometric specifications. The test was divided into two booklets, each containing two item
types (i.e., Booklet 1: Data Interpretation and Reading Comprehension; Booklet 2: Reasoning and
Quantitative Problem Solving). The two forms were then administered to state employees in a variety
of professional occupations according to standard instructions.

An item analysis was then conducted. Examinees who failed to answer the last five items of a test or
two-thirds of the total number of items on a test were excluded from the item analysis samples. Also,
items that were not attempted would be scored as incorrect.

Data was used from the pretest to compute Classical Test Theory (CCT) item statistics:

Item Difficulty

The target item difficulty was set at .60. Items that were seen to be very difficult (less than 20% of
examines answered correctly) and items that were very easy (more than 90% of examinees answered
correctly) were removed from the PET.

Item Discrimination

Item-total point-biserial correlations were used to determine item discrimination. Items that had the
highest discrimination values were selected.

Distractor Effectiveness

Items with inefficient distractors were avoided (i.e., where a positive distractor-total correlation
coefficient was obtained or where the item’s distractors were not selected).

87
Item Bias

The Delta-plot method (Angoff, 1982), was used to evaluate the extent to which items may have been
biased against Black test takers. Items that were seen to be disproportionately difficult for Blacks
were removed from the test.

Assessing Prediction Bias

PSI conducted a study of prediction bias on the PET using White and Black job performance in the
Facilitative job category. No other occupations or groups were studied due to small sample sizes and
insufficient statistical power to detect an effect. The Gulliksen and Wilks (1950) procedure was used
to evaluate differences between Black and White regression lines. Although intercepts were found to
vary between Whites and Blacks, these differences resulted in over-prediction for Black job
performance, a common result with cognitive predictors. Based on this result, PSI concluded that
PET was not biased against Blacks applicants. Table 47 provides the results from the study which
demonstrated a lack of prediction bias against Black applicants.

Table 47. Descriptive Statistics for PET Scores for Blacks and Whites
Blacks (N = 99) Whites (N = 115)
Criterion M SD Correlations M SD Correlations
PET-A 16.9 5.6 PET- PET- 22.9 6.5 PET- PET-
PET-B 17.3 4.8 A B 23.7 5.6 A B
Supervisory
3.9 .6 .22 .37 4.0 .7 .36 .23
Rating
Work
15.2 2.5 .42 .23 16.9 2.8 .28 .27
Sample
Job
Knowledge 25.5 4.6 .36 .39 29.5 4.4 .55 .47
Test

Psychometric Model

There is no indication that PSI applied IRT-based item analyses for PET items during the
development process. However, the interview with PSI’s industrial-organizational psychologist who
supports PET indicted that PSI does estimate IRT parameters for PET items as the sample size of
PET test takers has increased.

Multiple Forms

To create two equivalent forms, PSI assigned items that were thought to be closely matched to each
of the two forms with respect to difficulty and discrimination/reliability. They then computed reliabilities
between the two forms using the Kuder-Richardson (KR-20) formula. Results showed that the two
forms, PET-A and PET-B, were equivalent in difficulty, variance, and reliability. The correlation
between the two scores on the two forms was found to be .85 (n = 582). When corrected for
attenuation, the correlation is .99, and therefore showing two forms of the same test. Table 48 reports
the results of the equivalency study.

Table 48. Reliability of PET-A and PET-B.


Standard
Number of Standard Error of the
Form Examinees Mean Deviation Reliability Estimate
A 582 23.0 7.3 .86 2.94
B 582 23.2 7.2 .86 2.95

88
Item Banking Approaches

As would be expected from having only two fixed forms of each subtest, PET items do not appear to
be managed as a bank of items although the PSI I-O psychologist reported that PSI’s overall strategy
is to accumulate items in an increasingly large bank as new items are developed for new forms. It is
not known whether new forms of PET have been developed.

Approach to Item / Test Security

Test Security

For all four of the PET subtests there are two forms; Form A and Form B. Also, for online proctored
administration in the U.S., PSI’s ATLAS™ web-based talent assessment management system is
available as an enterprise web-based Software-as-a-Service (SaaS) solution for managing selection
and assessment processes. This platform can function in a number of capacities (e.g., configuring
tests, batteries, score reports, managing test inventory, manage and deploy proctored and
unproctored tests, establish candidate workflow with or without applicant tracking systems, etc…).
According to the publisher (PSI), the ATLAS™ platform has the ability to provide test items in a
randomized order. They have a specific form intended for online unproctored use internationally.
However, unproctored testing is forbidden by the publisher within the U.S. and they do make an
agreement with clientele that the PET must be proctored and overseen by a test administrator to
reduce cheating or dissemination of test items/materials. Another security step taken by the publisher
is supplying password protections for test takers.

Criterion Validity Evidence

Criterion Validity

Criterion validity studies were conducted for PET in the three target job families. The observed
validity coefficients were corrected for criterion unreliability and range restriction. The reliability
estimates for the supervisory ratings and work samples were not available, so PSI used the work of
Pearlman, Schmidt, and Hunter (1980) to estimate the reliability to be .60. The KR-20 reliability of the
job knowledge test was estimated to be .69.

The validity coefficients in Table 49 provide significant support for validity evidence of both forms of
the test, PET-A and PET-B. All correlations are significant and positive between the PET scores and
the criterion measures, resulting in corrected validity coefficients to range from .25 to .84. Although
not reported in the table, the weighted averages of corrected coefficients for PET-A and PET-B
were.54 and .48, respectively. Overall, the average was .51, which is congruent with validity evidence
of other tests of cognitive ability used in selection (Schmidt and Hunter, 1998).

Table 48 above also provided predictive validity for the norm groups of Whites and Blacks. Because
PET over-predicted Black performance compared to Whites, it does not create a disadvantage for
Black applicants.

89
Table 49. Validity Coefficients for PET – Corrected (Uncorrected)
Job Family
Research & Technical &
Facilitative Investigative Administrative
Criterion PET-A PET-B PET-A PET-B PET-A PET-B
Supervisory
.54 (.31) .53 (.29) .40 (.27) .32 (.22) .35 (.24) .25 (.17)
Ratings
(N) (223) (181) (132)
Work Sample
.61 (.42) .54 (.37) .44 (.30) .34 (.23) .84 (.59) .83 (.58)
(N) (224) (196) (137)
Job
Knowledge .75 (.56) .74 (.55)
Test
(N) (224)
Standardized
Two-Criterion (.48) (.44) (.38) (.31) (.49) (.42)
Composite
(N) (2220) (178) (123)

The technical manual for the PET also reports that there may be strong transport validity evidence for
the battery across other similar jobs. PSI note that US federal guidelines describe three requirements
for transportability of validity results in other jobs: (a) there must be a criterion-related validity study
meeting the standards of Section 14B; (b) there must be a test fairness study for relevant race, sex,
and ethnic groups, if technically feasible; and (c) the incumbents of the job to which validity is to be
transported must perform substantially the same major work behaviors as the incumbents of the job in
which the criterion-related study was conducted. The technical manual reports that the first two
requirements have already been met by the publisher, but the test user must establish job similarities.
In order for the test user to establish the similarity of a job to any of the jobs in the criterion-related
studies, the major work behaviors comprising that job must be identified. Resulting from such a
comparison, it can then be determined whether the criterion validity evidence can actually be
transported to other jobs.

Translations / Adaptions

No translations exist for the PET which is only available in English.

User Support Resources

 Technical manual
 Fact sheet
 Note: PSI does not publicly display user support resources.

Evaluative Reviews

The Twelfth Mental Measurements Yearbook

The reviewers of the PET provide positive feedback in regards to the reliability and validity evidence,
as well as the test construction and supporting documents (Cizek & Stake, 1995). A few of the
strengths highlighted are the validity evidence, bias prevention procedures, technical documentation,
instructions for test administrators, and also its ability to be administered quickly and scored easily. It
is a selection instrument that should be useful to government organizations that are selecting for
professional, managerial, and administrative positions.

However, there are also cautions of the PET. The reviewers point out that no construct validity or
test-retest indices are reported. Also, the validation sample encompasses only three occupation

90
groups. The reviewers also raised questions about PET’s potential for bias against certain applicant
groups.

DESCRIPTIVE RESULTS FOR PPM


Overall Description and Uses

Introduction

Hogrefe’s Power and Performance Measures (PPM) battery consists of nine subtests that may be
administered and scored in any combination appropriate to the interests of the organization.
Originally developed by J. Barratt in the late 1980’s, Hogrefe is now supports and markets PPM.
These subtests are organized into four groups based on two characteristics. The first characteristic
distinguishes between verbal and nonverbal content. The second characteristic distinguishes
between power and performance content. This classification of subtests is as follows.

Verbal Nonverbal
Power Verbal Reasoning Perceptual Reasoning
Numerical Reasoning Applied Power
Performance Verbal Comprehension Spatial Ability
Numerical Computation Mechanical Understanding
Processing Speed

Developed in the early 1990’s, the PPM battery was intended to represent a battery of cognitive ability
subtests measuring a wide range of specific abilities and aptitudes. In addition to the common verbal
v. nonverbal distinction, the unique feature of the PPM subtests is that they are organized around the
distinction between power and performance measures. The PPM Technical Manual describes power
subtests as designed to measure aptitude to learn, or reasoning, where the measure of the target
aptitude has little dependence on previous learning or experience. Performance subtests are
designed to measure ability to perform specific types of tasks, where the measure of the target
performance ability depends somewhat more on previous learning and experience, although not so
much as to be a test of knowledge.

It should be noted that PPM’s power measures are very similar to measures of “fluid intelligence”
described by the Cattell-Horn-Carroll (CHC) factor model of intelligence and the PPM performance
measures are somewhat more similar to measures of “crystallized intelligence”. In addition, it should
be noted that the PPM Technical Manual’s usage of the term “ability” to refer “what an individual can
do now and may be able to do after experience and training” is not a widely accepted use of the term
ability. More often, the terms aptitude and ability are interpreted as synonyms, both of which refer to
a general ability to learn and perform. For example, the expression “general mental ability” is
commonly used to refer to the general capacity to learn. Finally, it should be noted that the PPM
Spatial Ability subtest measures an attribute that is more often considered a facet of “fluid
intelligence”, particularly when measured using symbolic, nonverbal items as is the case in PPM’s
Spatial Ability subtest.

Purpose

PPM was introduced to provide a battery of subtests that could be used in various combinations to
inform personnel selection decisions and career counseling / skill development planning. PPM was
intended to be used by organizations and individuals to inform personnel decisions. Notably,
however, no item content in any subtest is specific to work activity. All subtest content is work
neutral. PPM was intended to be easy to use and score, thus facilitating its use in organizations.

91
Target Populations

The PM subtests are designed to be appropriate for working adults ranging from 18 tom 60 years.
Although no target reading level is provided, normative data has been gathered from three samples of
British adults, including 1,600 members of the general population, 337 professional managers, and
various samples of working populations in a variety of job categories.

Target Jobs / Occupations

At the time the PPM battery was being developed research evidence had largely confirmed that
cognitive ability measures were highly valid predictors of job performance across the full range of
occupations in the workforce. (See, Schmidt & Hunter, 1998) The PPM battery was designed to be
relevant to virtually all occupations that required a minimum of information processing complexity.
The authors anticipated that tailored combinations of the PPM subtests could be assembled based on
job information that would optimize the relevance of the PPM scores to performance in that target job.
As examples, they described likely tailored combinations for the following job groups.

Engineering / Technical

 Numerical Computation
 Spatial Ability
 Mechanical Understanding

Technological / Systems (IT)

 Applied Power
 Numerical Reasoning
 Perceptual Reasoning

High Level Reasoning

 Applied Power
 Verbal Reasoning
 Perceptual Reasoning
 Numerical Reasoning

General Clerical

 Verbal Comprehension
 Numerical Comprehension
 Processing Speed

Sales and Marketing

 Verbal Reasoning
 Verbal Comprehension
 Numerical Reasoning

The general applicability of PPM subtests across a wide range of occupations is facilitated by the
decision to only use work-neutral content in all PPM subtests.

92
Spread of Uses

Hogrefe describes two primary applications for PPM within organizations – personnel selection and
career counseling. Both applications are supported by a variety of norm tables that provide the PPM
subtest score distributions for various groups of individuals, organized around job types or disciplines.
The technical manual describes the following 10 disciplines.

 Science
 “Professional” Engineering
 Arts/Humanities
 Numeracy
 Engineering / Technology
 Clerical
 Business / Accountancy
 Literary
 Social Science
 Craft

For each profile, the technical manual provides the average subtest scores achieved by incumbents
on three different scales – subtest raw score, percentile score within the British working population
and a so-called “IQ” score within the British working population. The “IQ” score is a transformation of
the raw score scale on which the average score is 100 and the standard deviation is 10. A percentile
and IQ scale score based on the British working population is available for each subtest raw score
and for certain combinations of subtests such as verbal and nonverbal.

Hogrefe does not appear to recommend cut scores for selection but advises user organizations to
establish their own cut scores based on Hogrefe-supplied norm tables for common job groups
(discipline).

Hogrefe describes an approach to career counseling that relies on the similarity of an individual’s
PPM score profile to the score profile for the average incumbent in each of several relevant job
groups (disciplines). In general, the more similar one’s PPM score profile is to a job group profile, the
stronger the recommendation that the individual would be a good fit with jobs within that group.

93
Administrative Details

Administrative detail is summarized in Table 50 and briefly described below.

Table 50. Administrative features of the PPM subtests.

Subtest
Scoring Rule (Raw Other Score Method of
# Items Time Limit
score) Scales Delivery

Applied Power # Correct (both answers


25 12 minutes
must be correct)

Mechanical
32 8 minutes # Correct – 1/4 # wrong
Understanding

Numerical 40 6 minutes PPM tests are


Computation (Two 20- (3 minutes Percentile and
# Correct – 1/3 # wrong delivered in
item for each T-scores
paper-pencil
scores normed
sections) section) format and in a
against the
computer-based
General British
Numerical Reasoning 25 10 minutes # Correct – 1/4 # wrong population, the
format using the
Hogrefe
Working British
Perceptual TestSystem.
26 6 minutes # Correct – 1/3 # wrong population,
Reasoning Both modes use
and an
fixed item sets
industry
within each
Processing Speed 50 3 minutes # Correct – 1/3 # wrong sample
subtest.
Spatial Ability 26 6 minutes # Correct – 1/6 # wrong

Verbal
60 6 minutes # Correct – 1/3 # wrong
Comprehension

Verbal Reasoning 31 10 minutes # Correct – 1/4 # wrong

Time Limits

PPM subtests were designed to be relatively short measures for a variety of cognitive abilities.
Except for the highly speeded Processing Speed test, which allows only 3 minutes, the time limits for
the unspeeded subtests range from 6 minutes to 12 minutes. Among commonly used commercially
available cognitive test batteries, these are among the shorter time limits for subtests designed to
measure similar constructs

Number and Types of Items

Table 50 shows the number of items for each subtest, ranging from 25 to 31 items for the more
complex power (reasoning) subtests and Spatial Ability to 32 to 60 items for the less complex
performance subtests. None of the subtests include work-like content in the items. Four are heavily
dependent on verbal content, whereas five are somewhat less dependent on verbal content. Only
two, Numerical Computation and Perceptual Reasoning, are nonverbal. While reading level is not
specified for any subtest, the level of reading content is described by Hogrefe as appropriate for all
job levels. This is even true for the Verbal Reasoning subtest, which Hogrefe describes as
deliberately kept “relatively low” although no precise definition is provided for “relatively low”.

Type of Score

Scores on all subtests are computed as “number correct” scores, with a traditional correction for
guessing on all except Applied Power. This guessing correction reduces the number of correct
answers by a fraction of the number of incorrect answers, where the fraction is rounded down to the
nearest whole integer. This fraction is ¼ for Mechanical Understanding, Numerical Reasoning, and
Verbal Reasoning, 1/6 for Spatial Ability, and 1/3 for all other subtests.
94
Test Scale and Methods of Scaling

Two norm-based scores are provided in addition to the raw score, number correct adjusted for
guessing. Percentile scores and T-scores are provided each based on up to three norm groups, the
general British population. the working British population, and a relevant industry norm sample.

Method of Delivery

The PPM battery may be administered either on computer or in paper-pencil format. Based on
interviews with a Hogrefe industrial psychologist, Dr. Nikita Mikhailov, time limits, numbers of items
and types of items are the same for both delivery methods. Computer administration, however,
reports additional information including the numbers of correct, incorrect and missing responses, the
individual item responses as well as response latency information. Computer administration does not
require additional items or an expanded item bank beyond the items contained in the fixed paper-
pencil forms. Further, computer administration is expected to be administered in a proctored
environment, based on the responsibility of Level A users to protect the usage of PPM.

Cost

PPM materials required for administration are sold by the subtest. For each subtest administered in a
paper-pencil mode, the following costs apply.

Item Unit Cost


PPM Manual £58.00
Specimen Kit £160.00
Single Subtest Booklet £8.00
Set of 25 Answer Sheets £76.00
£6.60 - £8.00, depending
Single Subtest Online
on volume

Construct-Content Information

Intended Constructs

The types of subtests in PPM are typical of subtests in current commercially available cognitive ability
batteries. PPM consists of two verbal, two quantitative, two abstract reasoning, two
spatial/mechanical, and one speeded processing accuracy subtest. None of these subtests is
substantially different from comparable subtests in similar cognitive ability batteries. PPM is
somewhat distinctive, however, in that it includes a relatively high proportion of subtests that focus on
reasoning abilities. (Of course, all three subtests of the Watson-Glaser II battery are considered to
measure facets of reasoning.) These four subtests are the “power” subtests within the battery, in
which the relevance of learned/experienced content is minimized.

The PPM developer’s apparent frame of reference in the late 1980’s when developing PPM was the
set of well established, commonly used cognitive batteries available at the time. The developer
developed PPM to borrow from and improve on earlier commercially available cognitive batteries
including the Differential Aptitude Battery (DAT), Employee Aptitude Survey (EAS) and Wechsler
Intelligence Scales (WIS). The developer described the primary objectives for the development of
the PPM as:

 Provide a multidimensional battery of subtests measuring different types of aptitudes and


abilities from which groups of subtests could be tailored to specific jobs;

95
 Short time limits and ease of administration;
 Simplicity of scoring and interpretation;
 Applicable for selection and career guidance uses.

Barrett’s approach to the development of the subtests does not appear to be strongly influenced by
any particular theoretical foundation. (This, in spite of the apparent similarity of the power
(reasoning) subtests to CHC fluid intelligence and the performance tests to crystallized intelligence.)
Rather, his approach appears to have depended on the assumption that similarity between specific
job requirements, in terms of cognitive abilities, and specific test constructs was closely related to test
validity. An important consideration in the development of PPM was that subtests be different from
one another in ways that would presumably be differentially relevant to different jobs. The
combination of subtests associated with target job families shown above does not appear to have
been empirically developed. Rather, it appears to have been developed from assumptions /
observations about cognitive ability requirements of different job families. For example, performance
in Sales and Marketing jobs is presumed to be better predicted by a combination of Verbal
Reasoning, Verbal Comprehension and Numerical Reasoning than any other combination. In
contrast, performance in IT jobs (Technological/Systems jobs) is presumed to be better predicted by
reasoning subtests, Applied Power, Numerical Reasoning and Perceptual Reasoning. No
information could be located about PPM that provides a theoretical or empirical foundation for these
presumed differences between job families.

It is noteworthy to compare this perspective about job-specific validity to Schmidt’s (2012) recent
description of a content-based validation rationale for tests of specific cognitive abilities. That
content-based rationale requires that the job tasks and specific cognitive test tasks have sufficient
observable similarity to conclude that the cognitive ability measured by the test will predict
performance of the similar job tasks. But this content validity rationale is not a rationale for job-
specific validity. In other words, this rationale would not lead one to predict higher empirical
predictive validities for cognitive tests with stronger content evidence than for cognitive tests with
weaker content evidence. That is, stronger content evidence is a rationale for generalizing empirical
validity results for cognitive tests to the target job. But weaker content evidence does not imply
weaker validity, only that some other rationale, i.e., validity generalization, must be relied upon to
generalize empirical results.

Nevertheless, the PPM approach of tailoring cognitive composites to job families is very common
among other well-regarded commercially available cognitive ability batteries. Because tailoring is not
expected to harm validity, other reasons for preferring tailored composites may be appropriate.
Applicant acceptance is likely enhanced when they “see” the relevance of a test to the sought job.
To the extent the subtests depend significantly on already learned job knowledge, they may be better
predictors of early performance than work neutral subtests. (Of course, if already learned job
knowledge is an important consideration, job knowledge tests may be an effective supplement to or
an alternative to cognitive ability tests.)

Item Content in the Subtests

Note, in the descriptions below, the sample items were developed by the first author of this Report
because no practice or sample items could be located in publisher sources. These sample items
were intended to be similar to actual items in structure, format and content without disclosing any
actual item content.

Applied Power was designed to measure non-verbal, abstract reasoning, focusing on logical and
analytical reasoning. The items were designed to be appropriate for all job levels. Although no
information is provided about the intended reading levels of instructions, verbal ability is likely to be a
small factor in Applied Power because no item includes written verbal content. Every item consists of
a sequence of letter-number pairs with the last two pairs missing. Two answers must be chosen to

96
identify the two missing letter-number pairs. The sample item shown below is very typical of all 25
items except that in many items the letter is sometimes in in upper case and sometimes lower case to
provide more complex patterns within the sequence.

Each item is a patterned sequence of groups of letters and numbers. Identify the two groups that
would follow the sequence shown.

X1 - Y2 - X1 - Y2 - X1 - Y2 - X1 -

1. A) X1 B) X2 C) Y1 D) Y2

2. A) X1 B) X2 C) Y1 D) Y2

Mechanical Understanding was designed to measure understanding of basic mechanical principles


of dynamics, with a modest dependence on past related experience. Although no information is
provided about the intended reading level of items, verbal ability is likely to be a moderately small
factor in Mechanical Understanding scores. The stem of every item includes a picture depicting
some object(s) in a mechanical context and a written question about that picture. Presumably, past
experience with the depicted objects will influence scores on this subtest. The sample item shown
below is typical of all items in that it displays a picture of an object (a beam) and a question about that
object. However, the pictured objects vary considerably.

A B C

At which point is the beam most likely to balance?

Numerical Computation was designed to measure the ability to quickly solve arithmetic problems. .
Although no information is provided about the intended reading levels of instructions, verbal ability is
likely to be a small factor in Numerical Computation because no item includes written verbal content.
Every item stem consists of an arithmetic problem consisting of 2-4 integers or decimal numbers and
one or more of four arithmetic operation symbols, +, -, x, and ÷. The answer options are integers or
decimal numbers. The sample item shown below is very typical of all 40 items, with some less
complex and some more complex.

22 X 5 – 2 A) 66 B) 108 C) 25 D) 37

Numerical Reasoning was designed to measure the ability to reason about the relationships
between numbers. It was intended to be an indicator of critical and logical thinking. The items were
designed to be appropriate for all job levels. Although no information is provided about the reading
levels of the moderately complex instructions, verbal ability is likely to be a small factor in Numerical
Reasoning because no item includes written verbal content. Every item consists of three pairs of two
number values and a fourth answer alternative, “no odd pair”. Each pair of two numbers is presented
as an “is to” logical relationship. For example, the pair, “100 : 50”, is intended to represent the logical
relationship, “100 is to 50”. The task is to identify which one of the pairs, if any, represents a different
logical relationship than the other two. If all three pairs represent the same logical relationship, then
97
the correct answer is “no odd pair”. The sample item shown below is very typical of the 25 items,
except that some items contain fractions in the pairs of numbers.

Find the pair that has a relationship different from the other two pairs. Is some cases, no pair has a
different relationship

(A) 5 :10 (B) 1 : 2 (C) 2 : 10 (D) No different pair

Perceptual Reasoning is designed to measure the ability to reason about and find relationships
between pictures of abstract shapes. This subtest was developed to provide an indicator of general
intelligence and problem solving ability. The items were designed to be appropriate for all job levels.
Although no information is provided about the reading levels of the moderately complex instructions,
verbal ability is likely to be a small factor in Perceptual Reasoning because items contain very little,
non-complex written verbal content. The stem of each item consists of 2 -4 abstract shapes which
have some form of relationship to one another. The relationships include sequence, pattern, shape,
and logic, among other possibilities. Embedded in the stem of most items is short text that asks a
question about the target relationship. The alternatives in most items consist of four similar abstract
shapes, among which one represents the correct answer to the question. For those few items in
which pattern similarity is the target relationship, the question is “which one is the odd one out”. The
sample item shown below represents this style of item and is similar to some of the abstract shapes
used. However, a wide variety of abstract shapes are used among the 26 items.

is to as is to ?

(A) (B) (C) (D)

Processing Speed is designed to measure the ability to quickly perform a low-moderate complexity
mental task on a set of written content. More specifically, it is designed to measure the capacity to
quickly analyze and alphabetize three similar words. The items were designed to be appropriate for
all job levels. However, no information is provided about the reading levels of (a) the relatively simple
instructions, or (b) the stimulus words themselves, which range from common, simple four-letter
words to much less common, more complex words. While familiarity with the meaning of the words is
not required to alphabetize the words, differences in familiarity and understanding of the meaning of
the words potentially affects the speed with which this task may be performed. Each item consists of
three words with the instruction to identify the word that comes first in alphabetical order. Two
sample items are shown below. The words in the sample items below appear to the Report author to
be approximately average in length, complexity and familiarity.

For each item, you are shown three words. Identify the word that comes first in alphabetical order.

1. (A) James (B) Jerard (C) Janet

2. (A) Wind (B) Window (C) While

Spatial Ability was designed to measure the ability to visualize and mentally manipulate objects in
three dimensions. The items were designed to be appropriate for all job levels. Although no

98
information is provided about the reading levels of the complex instructions, verbal ability is likely to
be a small factor in Spatial Ability because items contain no written verbal content. However, the
instructions are relatively complex and may require more reading comprehension than other PPM
subtests. The stem of each item consists of two squares with lines drawn inside each. For each item
o
the task is the same. Mentally, rotate Square X 90 to the left, turn Square Y upside down so its top
side in on bottom, and superimpose rotated X onto upside down Y. The answer requires the test
taker to identify the number of spaces displayed in the superimposed square.

o
Rotate Square X 90 to the left. Turn Square Y upside down so its top side is on the bottom. Then
superimpose Square X onto Square Y.. How many spaces are formed by the resulting figure?

X. Y.

(A) 1 space (B) 2 spaces (C) 3 spaces (D) 4 spaces

It should be noted that Spatial Ability is regarded as a Performance measure, which is considered an
experience based ability to perform, and not a Power measure, which is considered a reasoning
ability. However, the abstract item types that comprise Spatial Ability do not appear to be susceptible
to experience and, instead require mental rotation of abstract, although familiar (squares), figural
representations. This type of task is more frequently recognize as an assessment of spatial
reasoning.

Verbal Comprehension was designed to measure the ability to understand English vocabulary.
While the level of word difficulty appears to vary considerably, the Report authors have not been able
to locate any documentation of the reading / vocabulary levels of the words used in the items. Each
item consists of two words. The test taker’s task is to determine whether the two words have similar
meaning, opposite meaning, or unrelated meaning. The sample items below are very similar to the
range of levels observed in the subtests

Each item consists of two words. Choose A if they have similar meaning, B if they have opposite
meaning, and C if there is no connection between their meanings.

1. Happy Glad

A. Similar B. Opposite C. No Connection

2. Ecstatic Elastic

A. Similar B. Opposite C. No Connection

Verbal Reasoning was designed to measure comprehension of and reasoning about verbally
expressed concepts. This subtest is intended to be an indicator of the ability to think critically and
logically about verbal content. This subtest is intended to be appropriate for all job levels, although
no information is provided about actual reading levels. Nevertheless, the Technical Manual indicates
that the vocabulary level was deliberately kept “relatively low” to avoid too much influence of language
achievement. As a Power subtest, Verbal Reasoning was intended to be an indicator of the ability to
think critically and logically. The two sample items shown below are typical of those used in this
subtest in that both are based on the same short reading passage. The 31 items in the subtest
consist of 8 reading passages, each with 3-5 items.
Overall, the PPM developers clearly sought to develop work neutral items for all subtests, with a mix
of verbal and non-verbal content. While no information was presented about reading levels of the
subtests, the developers intended to develop item content that would be appropriate for all job levels.
It is notable that item tasks for some subtests were novel in the Report author’s judgment (see, in
99
particular, Perceptual Reasoning and Spatial Ability) and may have resulted in relatively complex
instructions.

James’ 4-yr old daughter Sally has a close friend and classmate, Joyce. Joyce’s father Bill has an
uncle, Samuel, who was a good friend of James’ father.

1. Who is most likely to be the oldest person in the passage?

(A) James (B) Bill (C) Samuel

2. Who is Sally least likely to know?

(A) Bill (B) James (C) Samuel

Construct Validity Evidence

The publisher reported correlations between selected PPM subtests and subtests included in three
other commercially available cognitive ability batteries – Employee Aptitude Survey (EAS), SHL tests,
and NIIP tests. These construct validity correlations are shown in Table 51.
Although the PPM Technical Manual does not explain why certain correlations were reported and
others were not, the Report authors identified the EAS, SHL and NIIP subtests that appear to have
similar item content and underlying constructs to corresponding PPM subtests. These correlations
are regarded as estimates of convergent validities and are shown in the circles. Their average value
is .60, which is not uncommon for subtests measuring similar constructs where the subtest reliabilities
are in the upper .70s-.80s. The highest of these convergent validities are associated with pairs of
subtests with presumably very high construct and content similarity.

100
Table 51. Correlations between PPM subtests and other subtests regarded as measuring similar
constructs
PPM Subtest

Comprehensi
Computation
Understand.
Mechanical

Processing
Reasoning

Reasoning
Perceptual

Reasoning
Numerical

Numerical
Applied

Spatial
N Subtest

Ability

Verbal

Verbal
Power

Speed

on.
Study

144- Numerical
153 Reasoning .44 .60
152 Visual Speed .74
140- Space .61
158 Visualization .42 .53
20 Numerical .68
EAS Visual
140
Pursuit .34
Symbolic .49
140
Reasoning
Word .76
131
Knowledge
Verbal .57
201
Reasoning

97- Number
107 Series .29 .46
41- Numerical .60
SHL 182 Reasoning .39 .26
100- Verbal .44
182 Reasoning .23 .29

47, Bennett .49 .54


157* Mechanical
47 Arithmetic .55 .35
NIIP
47 Spatial Test .75
47 Verbal .61

Subtests pairs which do not appear to the Report authors to have high construct or content similarity
are not circled. These probably should not be regarded as divergent validities because many were
designed to measure similar constructs but with somewhat different types of item content.
Nevertheless, it is reasonable to expect these uncircled correlations to be lower on average than the
circled convergent correlations. They have an average value of .36. Overall, these construct
correlations appear to provide moderate to strong construct evidence for Numerical Computation,
Numerical Reasoning, Processing Speed, Spatial Ability, and Verbal Comprehension. Applied
Power, especially produced lower correlations than would be expected with other reasoning
measures even though none of those reasoning measures used the same item content as Applied
Power.

No factor analyses of PPM subtests have located and the Hogrefe psychologist is not aware of any
such studies of the PPM factor structure.

Item / Exam Development

The technical manual and other sources provided very little technical/psychometric information about
the processes used to develop the items and subtests. No information was provided about the item
writing processes including item specification provided to items writers, review processes that might
have been used to review for culturally salient, irrelevant content, pilot studies, item analyses,
psychometric standards for item acceptance/rejection, or the development of alternate, equivalent
101
forms. It appears clear that no IRT-based psychometric methods were applied. The Hogrefe
psychologist confirmed that no documentation is available about Barratt’s original item development
process and that new items have not been developed since then.

During development each subtest was administered to various samples of incumbent employees and
applicants. Although we are not certain, it appears these samples comprise the British working
population norms groups used to determine one of the T-score scales. Table 52 shows the
characteristics of these development norm group samples. The Age and Sex samples shown in
Table 52 provided the group statistics reported in Table 54 below.

Table 52. Demographics and statistics for the item development norm groups.

Information about Norm Groups


PPM Subtest
Mean SD N Description Age Range Sex N
17 yrs – 37 yrs
Female = 71
Applied Power 11.82 4.39 384 Clerical workers in large financial institutions Mean = 22.26 yrs
Male = 313
N=374

Female = 425
Processing Speed 19.91 6.63 555 Clerical workers in large financial institutions No information
Male = 99

Engineers, engineering tradespeople, factory 17 yrs – 60 yrs


Mechanical Female = 49
15.09 5.78 718 workers, manufacturing applicants, pilot Mean = 26.83 yrs
Male = 659
Understanding applicants N=616

Engineers, engineering tradespeople, factory 17 yrs – 60 yrs


Numerical Female = 57
13.80 6.27 793 workers, manufacturing applicants, pilot Mean = 27.32 yrs
Male = 724
Computation applicants N = 669

Engineers, engineering tradespeople (fitters, 17 yrs – 54 yrs


Numerical Female = 90
10.89 4.98 368 electricians), managers, pilot and Mean = 25.38 yrs
Male = 262
Reasoning manufacturing applicants, factory workers N = 255

17 yrs – 54 yrs
Perceptual Engineering tradespeople, factory workers, Female = 20
11.32 3.65 283 manufacturing applicants
Mean = 26.71 yrs
Male = 247
Reasoning N = 159

16 yrs – 59 yrs
Engineers, engineering tradespeople (fitters, Female = 32
Spatial Ability 9.27 4.15 590 electricians), pilot applicants, creatives
Mean = 25.66 yrs
Male = 558
N = 495

18 yrs – 60 yrs
Verbal Female = 13
21.20 10.79 162 Managers, engineering tradespeople Mean = 26.25 yrs
Male = 137
Comprehension N = 134

20 yrs – 58 yrs
Engineering tradespeople, factory workers, Female = 77
Verbal Reasoning 13.97 5.48 464 manufacturing applicants, managers
Mean = 29.33 yrs
Male = 371
N = 332

Table 53 shows results for subtest reliability and correlations among subtests. It is not clear what
sample(s) were used for reliability analyses. It seems most likely that the norm group samples
described above in Table xx are the samples that produced these reliability estimates. The
correlations among all subtests were generated in a separate sample of 337 professional managers.
The PPM Technical Manual describes all reliability estimates as alpha estimates of internal
consistency reliability. It should be notes that alpha estimates of reliability are positively biased for
speeded tests, such as Processing Speed. As a result, the .89 estimated reliability for Processing
Speed is very likely an over estimate by some amount. Two patterns of results in the subtest
correlations stand out. First, except for Spatial Ability and Processing Speed, the correlations of other
subtests with Mechanical Understanding are lower than would be expected. Often tests of
mechanical aptitude/reasoning are closely associated with general mental ability and tend to have
somewhat higher correlations with other subtests. Second, in contrast, the subtest correlations with
Processing Speed appear to be somewhat higher than expected. Often in batteries of cognitive
subtests, the highly speeded subtests correlate least with the other subtests. It is possible that the
PPM Processing Speed subtest is more heavily loaded on verbal reasoning because of the
requirement in each item that test takers determine the alphabetical order of often complex words.

102
Table 53. Item development statistics – Reliability and correlations among subtests.
Reliability Sample of 337 Professional Managers
PPM Subtest
(Alpha) Mean SD AP PS MU NC NR PR SA VC VR

Applied Power (AP) .88 13.07 4.69 --


Processing Speed
.89* 28.84 7.43 .44 --
(PS)
Mechanical
Understanding .79 18.39 5.76 .36 .20 --
(MU)
Numerical
.86 21.16 6.69 .36 .46 .14 --
Computation (NC)
Numerical
.83 12.88 4.37 .47 .48 .27 .59 --
Reasoning (NR)
Perceptual
.74 11.46 3.95 .46 .44 .32 .40 .51 --
Reasoning (PR)
Spatial Ability (SA) .79 12.57 3.23 .49 .34 .45 .30 .51 .43 --
Verbal
Comprehension .84 40.57 9.09 .25 .28 .33 .24 .25 .15 .25 --
(VC)
Verbal Reasoning
.83 17.12 4.44 .46 .53 .22 .37 .43 .45 .36 .34 --
(VR)
* Note: Alpha estimates of reliability are known to be overestimates with highly speeded tests.

Table 54 shows sex and age group differences on each PPM subtest. Because neither the Technical
Manual nor other sources provided any information about item development methods to minimize
biasing factors, it is not clear whether the group differences reported in Table 54 reflect bias-free or
biased estimates of valid group differences on the subtest constructs. Setting aside this uncertainty,
it is notable that female test takers scored higher than male test takers on six of the nine subtests,
particularly on Spatial Ability. A very common result is that males score higher than females on
spatial/mechanical aptitude tests. Perhaps the highly abstract feature of the PPM Spatial Ability item
content produces scores that are much less affected by previous spatial/mechanical experience,
which is regarded as an important factor in the more typical result favoring males. Another possible
explanation is that the norm groups for all subtests except Applied Power and Processing Speed
consisted in some part of engineering, manufacturing, and factory workers. The women in these
worker samples may not be representative of the British working population with regard to experience
and ability in the spatial/mechanical domains.

103
Table 54. Group means, standard deviations and sample sizes.
Younger than 22-
Male Female 22-24 yrs and older
24 yrs
Subtest Mean SD N Mean SD N Mean SD N Mean SD N

Applied Power 11.60 4.29 313 12.79 4.72 71 12.26 4.51 214 11.33 4.30 160
Processing Speed 17.38 7.13 99 20.54 6.41 425 Age correlated with score, r=-.16 (N=225)
Mechanical
15.32 5.74 659 12.14 5.82 49 16.43 5.45 327 14.49 5.81 289
Understanding
Numerical
13.87 6.36 724 13.01 5.52 57 14.64 5.81 328 13.75 6.48 341
Computation
Numerical Reasoning 10.76 5.16 262 11.83 4.33 90 13.79 4.48 122 10.75 4.83 133
Perceptual
11.28 3.59 247 12.80 3.64 20 12.78 3.41 82 11.29 3.71 77
Reasoning
Spatial Ability 9.21 4.15 558 10.19 4.08 32 10.41 3.72 270 8.15 4.40 225
Verbal
20.36 0.71 137 22.15 7.62 13 17.39 6.26 74 24.08 12.40 60
Comprehension
Verbal Reasoning 13.15 5.20 371 18.45 4.95 77 17.88 4.71 170 11.57 4.75 162

No evidence was presented in the PPM Technical Manual or could be located elsewhere about
alternate forms. No development information was presented about the manner in which they were
developed or the psychometric properties that demonstrated their equivalence. The Hogrefe
psychologist confirmed that IRT based psychometrics have not been used to evaluate PPM items and
exams.

The Hogrefe psychologist confirmed that the PPM approach relies on a fixed, non-overlapping item
set for each alternate form regardless of method of administration. PPM is not supported by an item
bank other than the items that are included in the fixed forms of the current subtests.

Criterion Validity Evidence

No criterion validity evidence has been located for PPM subtests in publisher-provided materials or in
any other publically available sources or in the one professional review found in Buros. Further, the
Hogrefe psychologist is not aware of any criterion validity studies or reports for PPM.

Approach to Item / Test Security

This information was discussed at some length with the Hogrefe psychologist responsible for
supporting PPM. Based on information he shared, PPM is a low volume battery used primarily in the
UK. While it is administered in both paper-pencil and online modes, it consists only of the two
original forms (we have only found documentation for one form). Hogrefe requires proctored
administration in both paper-pencil and online modes of delivery. Hogrefe relies entirely on
proctoring and administrator training and certification to ensure the security of individual items and
whole subtests. Hogrefe has no process for monitoring item psychometric characteristics to evaluate
possible drift due to over exposure, fraudulent disclosure or other patterns of cheating. Hogrefe
employs no data forensic methods for reviewing items responses to detect indicators of cheating or
attempts for fraudulently obtaining item content. In short, Hogrefe falls considerably short of
professional practices for the protection of PPM items and test content from cheating

Translations / Adaptations

PPM is available only in UK English and Hogrefe has had no experience translating PPM items or
supporting material into other languages.

User Support Resources (E.g., Guides, Manuals, Reports, Services)


104
Hogrefe provides only the bare minimum of support resources to enable users to administer and
score PPM subtests. Other than the test materials themselves, answer sheets, test booklets and
results Reports, the only available support resources are

 PPM Technical Manual, including administration instructions


 Sample Reports

Online administration may be arranged at tara.vitapowered.com.

No other support materials could be identified by the authors or by the Hogrefe industrial psychologist
who supports the PPM battery.

Evaluations of PPM

Only one professional review of PPM was been located. Geisinger (2005) reviewed PPM in Buros’
Sixteenth Mental Measurements Yearbook. His review appears to be based on an earlier technical
manual than was available for this project. However, in our judgment virtually all of his evaluations
apply equally to the technical manual reviewed for this project, except for the criticism that no group
difference data was presented. The manual reviewed for this project provides male-female
differences that apparently were not reported in the manual Geisinger reviewed. He criticized PPM
for the lack of predictive validity evidence, for reporting (positively biased) alpha coefficients for
speeded subtests, and for a lack of data about group differences. Overall, Geisinger was critical of
the lack of technical detail relating to item and subtest development and of the lack of criterion validity
evidence.

We have located no evaluative empirical analyses of PPM’s factor structure or other psychometric
properties in published sources or from Hogrefe. Hogrefe’s psychologist indicated that he was
unaware of any such empirical analyses.

Overall, the available documentation about PPM is not adequate to evaluate its criterion validity or the
processes of item development to ensure item quality and lack of measurement bias with respect to
culture or other group differences. (It should be noted that the nonverbal / abstract content in the
reasoning subtests in particular eliminates most sources of group-related measurement bias.) At the
same time, typical reliability levels and adequate convergent correlations with other established
subtests in other batteries provide some support for the psychometric and measurement adequacy of
PPM. Our evaluation of the PPM battery is that the subtests themselves appear to be well-
developed measures with adequate reliability and convergent validities with other tests.
Nevertheless, the publisher’s support of PPM in terms of validity studies and documentation as well
as security protection of the fixed forms falls well below professional standards for validity and
protection of test content. Two interviews with Hogrefe’s industrial psychologist who has recently
taken on support of PPM, Dr. Nikita Mikhailov, have confirmed this lack of support documentation.
The most likely explanation, in Dr. Mikhailov’s opinion, is that Hogrefe’s recent acquisition of PPM’s
previous publisher failed to transition any available support material to Hogrefe.

Descriptive Results for Verify

SHL’s Verify Battery

Authors’ Note: The information provided in this section is taken from the 2007 Technical Manual for
Verify and other available sources. In 2007, Verify consisted of three subtests, Verbal Reasoning,
Numerical Reasoning and Inductive Reasoning and their Verification counterparts.. After 2007, SHL
105
added four subtests to the Verify battery, Mechanical Comprehension, Calculation, Checking, and
Reading Comprehension and their Verification counterparts. At the time this Final Report was being
finalized, it was learned that two additional subtests have just been added, Spatial Ability and
Deductive Reasoning. However, in the interest of protecting their intellectual property, SHL declined
to approve our access to any further publisher-provided information about Verify and its additional
subtests beyond the 2007 Technical Manual and publically available documents. No SHL
documents or other published documents could be located that provide samples of item types for the
two most recent subtests or validity information beyond what was reported in the 2007 Technical
Manual.

Overall Description and Uses

Introduction

SHL developed the Verify battery of subtests in stages beginning with three subtests of Numerical
Reasoning, Verbal Reasoning and Inductive Reasoning, implemented by 2007. Four additional tests
were incorporated into Verify before 2013, Mechanical Comprehension, Checking, Calculation and
Reading Comprehension. The Verify battery is designed to be administered online in unproctored
settings. The where the randomized form is constructed from a large bank of items, each of which
has IRT estimates validity and security of these unproctored tests are supported by an overall strategy
with several key components. Each test taker is given a randomized form of each subtest and other
characteristics which govern the construction process to ensure all randomized forms are
psychometrically equivalent. In addition, SHL recommends that any applicant who reaches a final
stage of consideration should complete a somewhat shorter verification battery of subtests, also
randomized forms, to verify that the first test result was not fraudulently achieved.

The most distinctive feature of the Verify battery is the comprehensiveness of the security measures
taken to enable confidence in scores from unproctored, online administration. SHL’s Verify clearly
represents the best practice in employment testing for the use of unproctored online testing.

Purpose

SHL developed Verify to provide a professionally acceptable unproctored online assessment of


cognitive abilities used for personnel selection purposes. Developed in stages, Verify now represents
a broad range of subtest types which are appropriate for a wide range of occupations and a wide
range of work levels. Verify was among the first widely available, commercial cognitive batteries to
include a separate, proctored verification testing capability designed to evaluate whether there is any
risk that the first, operational score was achieved fraudulently.

Verify was developed, in part, to optimize the convenience of the test taking process for both
applicants and employers by enabling it to be administered to applicants anywhere, anytime.

A unique aspect of Verify is that the IRT-based construction rules enable SHL to control the “level”
(i.e., difficulty) of each randomized set of items. This allows SHL to tailor the level of the Verify
battery to any of six different levels associated with different roles.

Target Populations

Verify is appropriate for a wide range of worker populations because of its ability to tailor the difficulty
level of randomized items to the particular job/role. As a result, there is no one target reading level
for Verify items.

The item content of Verify subtests is tailored to 9 different levels of jobs (1) Manager / Professional,
(2) Graduate, (3) Skilled Technology, (4) Junior Manager, (5) Senior Customer Contact, (6) Skilled

106
Technical, (7) Junior Customer Contact, (8) Administrator , and (9) Semi-Skilled Technical. Levels 1-
3 comprise the Managerial/Graduate job group; Levels 4-6 comprise the Supervisory / Skilled
Technical job group; and Levels 7-9 comprise the Operational / Semi-Skilled Technical job group.

Target Jobs / Occupations

SHL describes the range of appropriate job families as “from operational roles to senior
management”. The different abilities assessed by Verify subtests and the different levels of item
difficulty and content enable Verify to be appropriate for a wide range of jobs. These include jobs in
the following industry sectors.

 Banking, Finance, & Professional Services


 Retail, Hospitality, & Leisure
 Engineering, Science & Technology
 Public Sector/Government
 General population.

In addition to occupation groups, each Verify subtest is linked to one or more of the following levels of
work as shown in Table 55. (Listed from highest to lowest level.)

Table 55. Use of Verify subtests at different job levels.


Verify Subtest*
Job Level Verbal Numerical Inductive Mechanical
Checking Calculation
Reasoning Reasoning Reasoning Comprehension
Director,
Senior X X X
Manager
Manager,
Professional, X X X X
Graduate
Junior
Manager, X X X X
Supervisor
Sales,
Customer
X X
Service, Call
Center Staff
Information
Technology X X X X
Staff
Administrative
and Clerical X X X X
Staff
Technical
X X X X X
Staff
Semi-Skilled
X X
Staff
*Reading Comprehension, Spatial Ability and Deductive Reasoning were not included in Verify at the
time SHL published this alignment

It should be noted here that the association of a subtest with a specific job level means that SHL has
specified a level of item difficulty in terms of IRT theta range for items to be included in randomized
forms for the particular job level. This implies that randomized forms are constructed based on
several conditions, including the job level for which the applicant is applying. That is, the item
content for any subtest differs depending on job level.

Spread of Uses

107
Verify appears to be used only for personnel selection. There is no indication in any SHL support
materials that Verify is used for other purposes such as career development. Certainly, SHL’s
significant investment in supporting the unproctored mode of administration attempts to protect
against fraudulent responding, which is likely to be relevant only for high stakes applications such as
employment.

Administrative Details

Table 56 reports several administrative features of the Verify subtests.


For each subtest and its verification counterpart, the number of items and administration time limit are
reported as well as the scoring rules and mode of delivery.

Time Limit

The time limits for the Verbal, Numerical and Inductive reasoning subtests are among the longer time
limits for similar subtests among the reviewed batteries. Time limits for the remaining tests are
typical of similar subtests in other batteries. The time limits vary somewhat for the Verbal and
Numerical subtests because higher level versions appropriate for higher level jobs/roles require
slightly more time than lower level versions. SHL manipulates “level” of test content by shifting the
difficulty levels of items included in the subtests to align with the level of role complexity.

Numbers and Types of Items

Overall, the numbers of items are typical of similar subtests in other batteries. Item content in the
Numerical and Verbal Reasoning subtests is somewhat more work-like than the large majority of other
subtests among the reviewed batteries. In contract, the Inductive Reasoning item content is abstract
and not at all work-like.

108
Table 56. Administrative features of Verify subtests.
Administrative Information
Subtest
# Items Time Limit Scoring Rule Delivery Mode

Verbal
30 17-19 minutes*
Reasoning

Verbal
Reasoning 18 11 minutes
Verification

Numerical
18 17-25 minutes*
Reasoning

Numerical
Reasoning 10 14-15 minutes*
Verification

Inductive
24 25 minutes
Reasoning For unproctored tests, IRT-
based ability estimates are
Inductive converted to percentiles,
Reasoning 7 7 minutes standardized sten scores, “Randomized”
Verification and T scores based on individual forms;
various norm groups Online;
Mechanical Unproctored
15 10 minutes
Comprehension For Verification tests, no
scores are reported. Only Verification tests
Mechanical
Verified or Not Verified is must be
Comprehension 15 10 minutes
reported. The determination administered in a
Verification
of Verified or Not Verified is proctored setting,
Checking based on a comparison of online with
25 4-5 minutes
the two IRT ability estimates “randomized”
Checking individual forms.
25 4-5 minutes
Verification

Calculation 20 10 minutes

Calculation
10 5 minutes
Verification

Reading
18 10 minutes
Comprehension

Reading
Not
Comprehension 10 minutes
Available
Verification

Spatial Ability 22 15 minutes

Deductive
20 18 minutes
Reasoning

Scoring

Verify produces test scores by estimating the ability level (theta) for each test taker based on their
item responses. The 2007 Verify technical manual provides the following description of this theta-
based scoring process.

Each test taker’s theta estimate is obtained through an iterative process which essentially operates,
for 2-parameter models, as follows:
A set of items for which a and b values are known are administered to the candidate
The candidate’s right and wrong responses to the items are obtained.

109
An initial estimate of the candidate’s θ is chosen (there are various procedures for making this
choice).
Based on the initial θ used and knowledge of each item’s properties, the expected probability of
getting the item correct is calculated.
The difference between the candidate answering an item correctly and the probability expected of
the candidate answering the item correctly, given the initial theta value, is calculated.
The sum of these differences across items is standardised, and this standardised difference is
added to the initial θ estimate (negative differences reducing the estimated theta and positive
differences increasing it).
If the differences are non-trivial, then the new θ estimate obtained from the previous step is used to
start the above cycle again.
This process is repeated until the difference between the value of • θ at the start of a cycle and the
value obtained at the end of a cycle is negligible. (Pg. 13)

See Baker (2001) for a more detailed account of theta scoring with worked examples. This
approach to scoring is suited to the randomised testing approach where candidates receive
different combinations of items. As all the items are calibrated to the same metric, then this process
also allows scores on different combinations of items to be directly compared and treated as from
the same underlying distribution of ability scores.

For reporting purposes, these theta estimates are converted to percentile scores, sten scores and T-
scores based on selected norm groups. SHL’s research based supporting the Verify subtests has
established over 70 comparison (norm) groups represented by combinations of industry sectors and
job levels and subtests.

Delivery Mode: Operational and Verification Subtests

The operational and verification subtests are both delivered online using randomized forms.
However, operational tests are delivered in unproctored settings while the verification subtests must
be delivered in proctored settings with identification confirmation in order to minimize the likelihood of
fraudulent verification scores. Also, in most cases, but not all, the verification version of a subtest
has fewer items and a shorter time limit than the operational version of the subtest. (Note, SHL’s
approach to the use of the operational subtests and their verification counterparts is that the
operational test results are considered to be the applicants’ formal test result that informs the hiring
decision. Verification test results are used only to identify the likelihood that the operational results
were obtained fraudulently. This policy is necessary because only some applicants take the
verification tests. If the policy were to combine the verification results with the operational results or
to replace the operational results with the verification results, fraudulent test takers would be
rewarded, in effect, for their fraudulent performance.)

Cost

Note, because SHL declined to approve the authors’ registration for access to SHL information, we
were unable to gather information about Verify costs.

Construct-Content Information

Intended Constructs

Table 57 provides SHL’s descriptions of the constructs intended to be measured by each Verify
subtest. Notably, with the exception of Inductive Reasoning which it describes as a measure of fluid
intelligence, SHL has not defined any of these constructs in the language of a particular theory of
ability such as the CHC model of intelligence or the French-Harmon Factor Model of Cognitive Ability.
Rather, as is typical of all reviewed batteries except Watson-Glaser, the ability constructs,

110
presumably, are taken from the accumulated validity evidence in personnel selection research
showing the types of cognitive ability tests that consistently yield substantial positive validity.

Table 57. Information about Verify subtest constructs and items content.

Subtest Item
Construct Definition Level Distinctions
Specifications
Verbal This is the ability to deductively reason Verbal subtests constructed for Deductive
Reasoning about the implications / inferences of Management and Graduate level reasoning
verbally presented problems. This is jobs should have an effective theta problems
deductive reasoning because it applies to range from -2.0 to +0.5 presented in verbal
problems that are “bounded and where content with a
methods or rules to reach a solution have Verbal subtests constructed for work-like context.
been previously established.” Supervisory and Operational levels
of jobs should have an effective
theta range from -3.0 to -0,8.

Numerical This is the ability to deductively reason Numeric subtests constructed for Deductive
Reasoning about the implications / inferences of Management and Graduate level reasoning
numerically presented problems. This is jobs should have an effective theta problems
deductive reasoning because it applies to range from -1.5 to +1.0 presented in
problems that are “bounded and where numeric content
methods or rules to reach a solution have Numeric subtests constructed for with a work-like
been previously established.” Supervisory and Operational levels context.
of jobs should have an effective
theta range from -2.5 to -0,9.

Inductive This subtest is a measure of fluid Inductive reasoning subtests Inductive reasoning
Reasoning intelligence (Cattell, 1971) and is constructed for Management and problems requiring
sometimes referred to as “abstract Graduate level jobs should have an the ability to reason
reasoning” due to the nature of conceptual effective theta range from -1.8 to 0.0 about relationships
level reasoning as opposed to reasoning in between various
the context of learned content. concepts
independent of
acquired
knowledge. Item
content is abstract
and does not
include any work-
like context.

Mechanical Relevant to many technical roles, the Typically used for production, Not Available
Comp. Mechanical Comprehension test is manufacturing or engineering-
designed to measure a candidate’s related recruitment or development,
understanding of basic mechanical the test is often deployed to assess
principles and their application to devices school leaver suitability for modern
such as pulleys, gears and levers. apprenticeship schemes, the
practical application abilities of
science and technology graduates
or those with work experience
looking to move to a technical role.

Checking Designed to measure a candidate’s ability The test is relevant to entry-level, Comparisons of
to compare information quickly and administrative and clerical roles, as target
accurately, the Checking test is particularly well as apprenticeship schemes. alphanumeric
useful when assessing an individual's strings with other
potential in any role where perceptual alphanumeric
speed and high standards for maintaining strings
quality are required.

Calculation Designed to measure a candidate’s ability The test is relevant to entry-level, Item content
to add, subtract, divide and manipulate administrative and clerical roles, as includes arithmetic
numbers quickly and accurately, the well as apprenticeship schemes operations used in
Calculation test is particularly useful when work-like
assessing an individual’s potential in any calculations and
role where calculation and estimation, as estimation.

111
well as auditing and checking the
numerical work of others, are required.

Reading The Verify Reading Comprehension test An effort has been made to ensure The task involves
Comp. measures a candidate's ability to read and the text passages and resulting reading a passage
understand written materials and is useful questions are relevant to as wide a of text, and
for assessment at a range of job levels. range of industries and roles as answering a short
This ability is very important wherever possible written question..
candidates will be expected to read, The content of the
understand and follow instructions, or use Note: No information is provided test makes no
written materials in the practical completion about the theta ranges for different assumptions about
of their job. job levels. prior knowledge,
with applicants
directed to use only
the information in
the passage to
derive their
answer.

Spatial Intended to measure the ability to rapidly The Verify Spatial Ability Test is an Sample tasks for
Ability perceive and manipulate stimuli, and to online screening assessment. It jobs that may
accurately visualise how an object will look enables organisations to recruit require spatial
after it has been rotated in space. The test candidates applying to jobs at all ability include, but
is designed to provide an indication of how levels that require spatial ability. are not limited to: •
an individual will perform when asked to rapidly perceiving
manipulate shapes, interpret information Note: No information is provided and manipulating
and visualise solutions. The Spatial Ability about the theta ranges for different stimuli to
test is completely non-verbal and features job levels. accurately visualise
only shapes and figures. This ability is how an object will
commonly required when an individual is look after it has
required to work with complex machinery been rotated •
or graphical information. correctly
interpreting
graphical
information •
visualising the
interactions of
various parts of
machines •
efficiently
communicating
visual information
(e.g., using charts
or graphs) in a
presentation

Deductive Intended to measure the ability to: draw This form of reasoning is commonly Sample tasks for
Reasoning logical conclusions based on information required to support work and jobs that may
provided, identify strengths and decision-making in many different require deductive
weaknesses of arguments, and complete types of jobs and at many levels reasoning include,
scenarios using incomplete information. but are not limited
The test is designed to provide an Note: No information is provided to: • evaluate
indication of how an individual will perform about the theta ranges for different arguments •
when asked to develop solutions when job levels. analyse scenarios •
presented with information and draw sound draw logical
conclusions from data. conclusions
* The content of each Verify subtest may be tailored to three or more of nine different job levels. Each
subtest has versions for 3-6 different levels.

In SHL’s case, they report that the constructs selected for test development were informed by their
accumulated validity evidence showing the types of cognitive ability tests that had been predictive of
each of seven dimensions of performance across a wide range of job families. SHL refers to these
seven performance dimensions as the SHL Universal Competency Framework (UCF). The UCF
dimensions are
 Presenting and Communicating Information
 Writing and Reporting

112
 Applying Expertise and Technology
 Analyzing
 Learning and Researching
 Creating and Innovating
 Formulating Strategies and Concepts

Extensive SHL research, summarized by Bartram (2005), has identified these performance
dimensions and investigated the validity of a variety of types of predictors including cognitive abilities.
This research provided the foundation for SHL’s decisions to select the Verify cognitive constructs.
Unfortunately, no publically available research database shows the matrix of meta-analyzed validities
between these UCF dimensions and various cognitive ability tests. Nevertheless, SHL’s strategy for
selecting constructs for Verify subtests is clear. They relied on the validities previous cognitive test
demonstrated against a set of universal performance dimensions relevant across job families and job
levels. They did not rely directly and explicitly on a theoretical framework of broad and narrow
abilities.

The Level Distinctions column provides SHL’s description of the item difficulty ranges for each of the
reasoning subtests, for which this information was provided. As shown in Table 57, except for
Reading Comprehension, all Verify subtests are available in multiple levels of difficulty ranging from
six levels of difficulty for Verbal and Numerical Reasoning to three levels for Checking and
Calculation. Verify was initially developed with three subtests primarily to be appropriate for higher
level management and service jobs. The later expansion of Verify included subtests, such as
Checking and Calculation that are more suited to lower level jobs.

Item Content

Sample items are shown below for the three reasoning subtests. Sample items were not available for
the remaining four subtests.

Verbal Reasoning

Verbal Reasoning content was developed with a clear work-like context, although this work-like
context, itself, is unlikely to change the construct being measured because each statement requires
deductive reasoning about presented information rather than judgment about how that information is
applied in a work context. However, this item content clearly represents an application of crystalized
ability rather than fluid ability and this presumably is due the verbal complexity of the item, rather than
its work-like context.

“Many organisations find it beneficial to employ students over the summer. Permanent staff often wish
to take their own holidays over this period. Furthermore, it is not uncommon for companies to
experience peak workloads in the summer and so require extra staff. Summer employment also
attracts students who may return as well qualified recruits to an organisation when they have
completed their education. Ensuring that the students learn as much as possible about the
organisation encourages interest in working on a permanent basis. Organisations pay students on a
fixed rate without the usual entitlement to paid holidays or bonus schemes.”
Statement 1 - It is possible that permanent staffs who are on holiday can have their work carried out
by students. T-F-Cannot say
Statement 2 – Students in summer employment are given the same paid holiday benefit as
permanent staff. T-F-Cannot say
Statement 3 – Students are subject to the organisation’s standard disciplinary and grievance
procedures. T-F-Cannot say
Statement 4 – Some companies have more work to do in the summer when students are available for
vacation work. T-F-Cannot say

113
Numerical Reasoning

Very similar to Verbal Reasoning, the Numerical Reasoning content demonstrated in both sample
items was developed with a clear work-like context, although this work-like context, itself, is unlikely to
change the construct being measured because each statement requires deductive reasoning about
presented information rather than judgment about how that information is applied in a work context.
However, this item content clearly represents an application of crystalized ability rather than fluid
ability and this presumably is due primarily due to number knowledge required and perhaps to a
modest level of verbal complexity, rather than its work-like context.

Sample 1

For each question below, click the appropriate button to select your answer. You will be told whether your answer
is correct or not.
Newspaper Readership
Percentage of adults reading each paper in Year
Daily Newspapers Readership (millions) 3
Year 1 Year 2 Males Females
The Daily Chronicle 3.6 2.9 7 6
Daily News 13.8 9.3 24 18
The Tribune 1.1 1.4 4 3
The Herald 8.5 12.7 30 23
Daily Echo 4.8 4.9 10 12
Question 1 - Which newspaper was read by a higher percentage of females than males in Year 3?

The Daily
The Tribune The Herald Daily News Daily Echo
Chronicle
Question 2 – What was the combined readership of the Daily Chronicle, the Daily Echo and The Tribune in Year
1?

10.6 8.4 9.5 12.2 7.8

114
Sample 2

Amount Spent on Computer Imports

Question 3 – In Year 3, how much more than Italy did Germany spend on computer imports?

650 million 700 million 750 million 800 million 850 million

Question 4 – If the amount spent on computer imports into the UK in Year 5 was 20% lower than in Year 4, what
was spent in Year 5?

1,080 million 1,120 million 1,160 million 1,220 million 1,300 million

Inductive Reasoning

Because it was intended to measure fluid ability, Inductive Reasoning was developed to avoid
requiring acquired knowledge. This eliminated the opportunity to include work-like content

115
In each example given below, you will find a logical sequence of five boxes. Your task is to decide which of the
boxes completes this sequence. To give your answer, select one of the boxes marked A to E. You will be told
whether your answer is correct or not.
Questions

Question 1

A B C D E

Question 2

A B C D E

Question 3

A B C D E

Question 4

A B C D E

Mechanical Comprehension

If all items are similar to this sample item, Mechanical Comprehension is a fluid ability measure and
contains work-neutral content.

116
Question: As
the wedge
moves
downwards, in
which direction
will the slider
move”

A. Right
B. Left
C. The
slider
will not
move

Checking

Checking is a processing speed test in which the score for each item is the speed with which the
candidate makes choice. Items are presented on the screen one at a time following the response to
the previous item.

In this sample item, the candidate is asked to identify an identical string of letters or numbers from the list of
options on the right. Each item is individually timed.

WNPBFVZKW A. WPBNVFZKW
B. NWPBFVZWK
C. NWBPFVKZW
D. WPNFBVZKW
E. WNPBFVZKW

Calculation

Each item in this subtest requires the candidates to calculate the number that has been replaced by
the question mark. Each item is individually timed. The answer is provided on a separate answer
screen. Notably, the use of a calculator is allowed for this test.

The calculation is

? + 430 = 817

Reading Comprehension

Although SHL does not report reading levels for its Reading Comprehension items, this sample item
demonstrates a moderate level of complexity.

117
Passage Question:

Biotechnology is commonly seen as ethically neutral. Where has genetically modified food been rejected?
However, it is closely related to the conflicting values
of society. Genetically modified food has the potential
to bear more resilient and nutritious crops. Thus it A. Society
may help the fight against world hunger. However, it B. Europe
also raises concerns about its long term-effects and C. Agriculture
ethics. It is this controversy that has led to the D. Nowhere
rejection of genetically modified food in Europe.

Sample items are not available for Spatial Ability or for Deductive Reasoning.

Construct Validity Evidence

As shown in Table 58, SHL reported two construct validity studies in which modest samples of college
students completed the three Verify reasoning subtests and either Ravens Advanced Progressive
Matrices or the GMA Abstract subtest. (Note, in these studies the three Verify reasoning subtest
were shorter versions of the operational subtests and appear to be defined as the verification subtests
are defined.)

Table 58. Observed (uncorrected) construct validity results for Verify subtests.

Construct Validity
Observed Correlations in Samples of College Students
Subtest:
Ravens Verify Verbal Verify Numerical
GMA Abstract
Matrices Reasoning Reasoning
(N=49)
(N=60) (N=109) (N=109)

Verbal Reasoning .45 .39

Numerical Reasoning .40 .37 .25 --

Inductive Reasoning .56 .54 .39 .32

At the time of these studies, Mechanical Comprehension, Checking, Calculation, Reading


Comprehension, Spatial Ability and Deductive Reasoning were not part of the Verify battery. As
expected, all Verify subtests correlated substantially with both Ravens and GMA Abstract, with
Inductive Reasoning correlating highest. This result is consistent with the abstract content feature
Inductive Reasoning shares with Ravens Matrices and GMA Abstract.

Item and Test Development

Developing Items and the Item Bank

Similar to all other batteries except Watson-Glaser, no available SHL document provides an explicit
description of the item writing procedures used to generate Verify items. It is clear, however, that all
Verify items were evaluated during the development process using a 2-parameter IRT model as the
basis for the decision to retain the items in the Verify item bank. The 2007 Verify Technical Manual
provides the following descriptions of the IRT-based process for evaluating and retaining items in the
Verify item bank.

“The fit of 1, 2 and 3-parameter models to SHL Verify ability items were tested early in the
SHL Verify programme with a sample of almost 9,000 candidates. As expected, the fit of a 1
118
parameter model to items was poor, but the expected gain from moving from a 2 to a 3-
parameter model was not found to be substantial, and for the majority of items evaluated
(approaching 90%) no gain was found from moving to a 3-parameter model. Accordingly, a 2-
parameter model was selected and used for the calibration of verbal and numerical item
banks as generated for the SHL Verify Range.
The item development programme supporting the SHL Verify Range of Ability Tests extended
over 36 months during which items were trialed using a linked item design and with a total of
16,132 participants. Demographic details of the sample used to evaluate and calibrate SHL
Verify items are provided in the sections in this manual that describe the SHL Verify
comparison groups and the relationships between SHL Verify Ability Test scores and sex,
ethnicity and age.
Items were screened for acceptance into the item bank using the following procedure:

 A sensitivity review by an independent group of SHL consultants experienced in


equal opportunities was used to identify and screen out items that might be
inappropriate or give offence to a minority group. This was conducted prior to item
trials.
 Once trial data was obtained, a-parameters were reviewed with items exhibiting low
a-parameters being rejected.
 Review of b-parameters with items exhibiting extreme values (substantially less than -
3 or greater than +3) being rejected.
 Review of item response times (time to complete the item) with items exhibiting large
response times (e.g. 2 minutes) being rejected.
 Item distractors (alternate and incorrect answer options presented with the item) were
also reviewed with items being rejected where distractors correlated positively with
item-total scores (i.e. indicators of multiple correct answers to the item) or where the
responses across distractors were uneven (the latter analysis being conditional on
the difficulty of the item).
Items surviving the above procedure were subjected to a final review in terms of a and b-
parameters as well as content and context coverage (i.e. that the item bank gave a
reasonable coverage across different work settings and job types). This final review also
sought to provide a balance across the different response options for different item types.
That is, the spread of correct answers for verbal items avoided, say, the answer A dominating
over B and C correct answers across items in the SHL Verify item bank, and that the spread
of correct answers was approximately even for A, B, C, D and E options across numerical and
Inductive Reasoning items.” (Pg. 12)

Subtest Reliability

Once the Verify item banks were developed, at least to a sufficient extent, the reliability of randomized
forms was evaluated. Each randomized form of a subtest is developed by applying a set of rules to
the selection of the required number of items from the bank for that subtest. Unfortunately, SHL does
not provide even a high level description of the procedures for selecting items into a randomized form.
However, at least three considerations are presumed to apply.

 Some threshold level for standard errors of the theta or the test information function
associated with a set of items. That is, items in different forms must be selected that produce
similarly reliable estimates of theta.
 Items must selected to represent the level of the randomized subtest, where level is defined
as a prescribe range of theta values. Higher level jobs are associated with higher ranges of
thetas.
 Items must be selected to be representative of the bank distribution of job-related item
characteristics such as work settings and job types.

Estimating the traditional reliability of randomized forms tests, no one of which is likely to be identical
to any other, requires that many randomized forms be generated and the average internal
consistency measure of reliability (alpha) be computed for the many generated forms. SHL uses
procedures for estimating alpha from IRT results provided by du Toit (2003).
119
Table 59 shows the average alpha reliabilities derived for each of the three reasoning subtests in
large development samples. For the Verbal Reasoning and Numerical Reasoning subtests, reliability
was estimated separately for two different levels of subtests. This was done because the item banks
for different levels are somewhat different, although overlapping. Inductive Reasoning reliability was
estimated at only one level even though it is available at more than one level. Typically, Verify
subtest reliabilities are in the upper .70’s to mid-.80’s which is generally considered adequate for
selection tests. The Inductive Reason verification test reliability averaged .72 almost certainly
because it is much shorter – 7 items - than the operational Inductive Reasoning subtest – 24 items.

Table 59 also shows the magnitude of group differences for gender, race and age. In the section
above, Developing Items and the Item Bank, the brief description of some steps in the item
development process indicated that an independent group of SHL consultants screened items that
may be inappropriate or offensive to a minority group. But no mention is made of common empirical-
statistical tactics for evaluating evidence of differential item functioning. Because empirical methods
were not used, the possibility increases somewhat that the Verify subtests may be subject to group-
based bias. If group mean differences are unusually large for the Verify subtests, there could be an
increased concern about possible sources of bias. However, Table 59 shows that most group
differences, expressed as standardized mean differences, d values, are very small and only two
values, both for Numerical Reasoning are moderately small. Most notably, all White-Black group
differences are very small, which is a different result than is typically found in US-based research on
group White-Black differences. The magnitude of the group differences reported in Table 59
suggests that the prospect of group bias is unlikely.

120
Table 59. Reliability and group difference estimates for Verify’s three original reasoning subtests.

Reliability Male-Female White-Non-White


<40 - >40 Difference
Subtest: Difference Difference

Level Non-
Average Male Female White < 40 >40
*d White *d *d
Alpha N N N N N
N
Verbal Reasoning
Overall .80 4,382 3,885 .06 3,796 3,796 .11 5,155 3,028 .04
Manager &
.81
Graduate
Supervisor &
.78
Operational
Verbal
.77
Verification
Numerical
Reasoning
Overall .84 4,382 3,885 .23 3,796 3,796 .09 5,155 3,028 .22
Manager &
.83
Graduate
Supervisor &
.84
Operational
Numerical
.79
Verification
Inductive
.77 4,200 3,769 .01 3,291 4,678 -.08 7,228 614 .14
Reasoning
Inductive
Reasoning .72
Verification
*For each pair of groups, d is defined as the difference between group means divided by the SD of
test scores, where the target group mean (i.e., Female, Non-White, >40) is subtracted from the
reference group mean (i.e., Male, White, <40)

Criterion Validity Evidence

SHL presents criterion validity data only for the Verbal Reasoning and Numerical Reasoning subtests.
During the development of Verify seven criterion validity studies were conducted. Table 60 describes
key characteristics of these seven studies, two of which did not include Verbal Reasoning.

121
Table 60. Characteristics of SHL validity studies for Verbal Reasoning and Numerical Reasoning.

Verbal Numerical
Industry
Job Level Country Criterion Measure Reasoning Reasoning
Sector
Sample Size Sample Size
Manager’s ratings of Not Used in
Banking Graduate UK 102
competency Study
Manager’s ratings of
Banking Manager Australia 221 220
competency
Professional Not Used in
Graduate UK Accountancy exam result 11
Services Study
Manager’s ratings of
Financial Supervisor UK 45 45
competency
Supervisor’s ratings of
Financial Operational UK 12 121
competency
Manager’s ratings of
Retail Operational US 89 89
competency
Performance on business
Education Operational Ireland 72 72
education exams
Total Sample 548 760

This information shows that two types of performance measures were used. In two studies, local job-
related job knowledge tests were used as criteria. The remaining five studies all used ratings of job-
related competencies provided by the supervisor / manager. No further information is provided.
However, it is likely that the competency ratings were based on SHL’s Universal Competency
Framework, which has a strong psychometric and substantive foundation.

The results of a meta-analysis applied to the validities from these seven studies are shown in Table
61. These results show that the weighted average validity, corrected for range restriction and
criterion unreliability, was .50 for Verbal Reasoning and .39 for Numerical Reasoning. The Verbal
Reasoning results is virtually identical to the overall average operational validity reported by Schmidt
and Hunter (1998) for general mental ability, which in most validity studies is a composite of 2-4
cognitive subtests. The .39 operational validity estimate for Numerical Reasoning does not indicate
that it is estimated to be less valid than typical cognitive subtests. It is the composite of subtests
representing general mental ability that is estimated to have an operational validity of .50. These
results show that all variability in observed validity values is accounted for by sampling error.

Overall, these validity studies are in line with the broad, professionally accepted conclusion about the
level of predictive validity for measures of cognitive ability.

Table 61. Meta-analyses of validity studies of Verbal Reasoning and Numerical Reasoning.
Meta-analysis of Verify Ability
Verbal Reasoning Numerical Reasoning
Test Validities
Number of Studies (K) 5 7
Total Sample Size 548 760
Average Sample Size 110 109
Range of Observed Validities 0.21 to 0.43 0.11 to 0.34
Variance in Observed Validities
0.01 0.00
(A)
Sampling Error Across Studies
0.01 0.01
(B)
True Variance in Validities (A-B) 0.00 -0.01
Weighted Mean Operational
0.50 0.39
Validity

122
Approach to Item and Test Security

Perhaps the most distinctive aspect of the Verify battery is the item and test security approach SHL
has implemented primarily to ensure that scores from this unproctored online assessment are credible
and adequately protected against various methods of cheating and piracy. At the same time, SHL
has been active professionally in describing its overall security strategy for Verify in presentations and
written material.

Even though MCS is very unlikely to implement unproctored test administration as part of it civil
service system, many aspects of Verify’s security strategy are likely to be applicable to MCS’s
proctored, computer administered testing process. Study 2 recommendations strongly encourage
MCS to launch its testing system with a very strong and prominent security strategy to build credibility
in the new process and to discourage attempts to cheat or defraud the system. Several of the
components of SHL’s Verify security strategy described below may be applicable to MCS’s civil
service testing system.

SHL describes seven major components of its overall security strategy for Verify.

1. Capitalize on technology
2. Develop a large item bank
3. Verify unproctored test results with proctored test results
4. Continually monitor sources of information for indicators of cheating
5. Support flexible testing options
6. Use scientific rigor
7. Clear communications with the test taker

Capitalize on Technology

Unproctored Verify relies heavily on randomized test forms for each individual and on response times
as indictors of potentially fraudulent attempts to complete tests. These strategies are enabled by
Flash player applets. There are several advantages of randomized forms, even with proctored
administration. Randomized forms are versions of whole subtests generated by an IRT-based
algorithm that are psychometrically equivalent but contain different items from one another. For each
test taker, one computer-generated version is randomly assigned. The advantages of randomized
forms include

 Minimize item exposure. Depending on bank size and other factors, each item is
expected to appear on only a small percentage of all forms.
 Minimize answer key exposure. The answer key for each randomized form is not the
same due to different items and different items sequences.
 Is discouraging to attempts to cheat or steal items where there has been clear
communication about their use

Response time information can be used effectively in data forensics as one source of information
about possible cheating (e.g., patterns of very short response times) or possible efforts to steal item
information (e.g., patterns of very long response times).

Develop a large item bank

A large item bank is important for a variety of reasons.

 Enables the generation of randomized forms


 Reduces the overall impact on the testing program of occasional compromised items
 Encourages a strategy of continual item replacement or temporary retirement
 Supports development of item psychometric properties under the conditions in which the
items are used (International Test Commission, (2006) (ITC), Guideline 22.3)
123
Compare unproctored test results to proctored test results

ITC Guideline 45.3 establishes that applicants whose unproctored test scores result in them being
regarded as qualified should be required to compete a proctored version of the same test to check the
consistency of the unproctored an proctored scores. Verify explicitly implements this strategy with its
“Verification” forms of each of the Verify subtests. Applicants who are regarded as qualified to be
hired are required to complete a proctored verification subtest for each Verify subtest that contributed
to their “qualified” status. SHL has developed a sophisticated algorithm for estimating the likelihood
of the two sets of scores results occurring by chance and has carried our extensive analyses of the
odds of detecting cheaters, as reported in the 2007 Verify Technical Manual.

Importantly, SHL’s practice is to not combine or replace the initial unproctored test result with the
verification test result. In certain ways, this can reward the cheater to the disadvantage of non-
cheaters who achieve the similar unproctored scores. SHL discourages the practice of combining or
replacing unproctored score by not reporting the verification score but only reporting whether the
verification result confirms or disconfirms the unproctored result.

Clearly, verification testing would not be relevant if all MCS civil service exams are proctored.
However, even in the case of proctored exams, if MCS undertakes other data forensic analyses to
identify possible cheaters, a possible administrative course of action would be to require the applicant
to complete a second administration of the test. In such a case, the verification approach used in
unproctored testing may be applicable.

Continually monitor sources of information for indicators of cheating and piracy

SHL has established a partnership with Caveon Test Security, an industry leader in evaluating
evidence of cheating and/or piracy, to monitoring SHL’s Verify item database and routinely inspect
item level responses and response times for indicators of cheating or other fraudulent practices. In
addition, SHL actively engages in web-patrols continuously search for web sites in which information
may be disclosed about cheating or piracy attempts against Verify, or for that matter, other SHL test
instruments. These web patrols can identify not only communications about cheating and piracy
efforts but can also build an understanding of the reputation of the target test.

While such strategies would have a high priority with unproctored testing, they may also provide
significant value with proctored testing, which MCS should assume will also be the target of
collaborative efforts to steal and sell information about the civil service exams.

Support flexible testing options

Given the cost and convenience advantages of unproctored online testing, SHL encourages the use
of Verify by making it available in a variety of ways, for a variety of jobs and embedded in a variety of
delivery systems. This feature of Verify has less to do with security than it has to do with marketing
the use of Verify.

Use scientific rigor

To establish credibility and confidence in the use of Verify especially given that it is administered in
unproctored settings, SHL has, in certain respects, demonstrated a high level of scientific rigor in its
development, use and maintenance. The development of large item banks, the development of IRT-
based psychometric properties in large samples, the use of randomized forms, the investment in data
forensic services and their high visibility in professional conferences all contribute to a reputation for
Verify that it yields credible scores.

Communicate a clear “Honesty” contract with test takers.

An important component of any large scale testing program in which test takers have opportunities to
know and communicate with one another, is that the test publisher clearly communicate about

124
security measures used to protect the test but also require that test takers agree to complete the test
as they are instructed to complete it. In the case of unproctored testing, ITC Guideline 45.3 states
that
“Test takers should be informed in advance of these procedures (security measures) and
asked to confirm that they will complete the tests according to instructions given (e.g., not
seek assistance, not collude with others etc.)
This agreement may be represented in the form of an explicit honesty policy which the test-
taker is required to accept.” (ITC, 2006, p. 164)

Study 2 Authors’ Comment Regarding This Description of SHL’s Approach to Unproctored


Testing

Our view of SHL’s approach to the maintenance of item and test security is that it represents a
professional “best practice” in the sense that SHL adheres to virtually all professional guidance
relating to the use of online testing and, in particular, unproctored online testing. Indeed, SHL
principals have been in the forefront of communicating about the inevitability and the need for
professional standards guiding unproctored online testing. In short, if MCS were to decide that
unproctored administration was needed, SHL’s Verify is currently the best “role-model” for what a
comprehensive approach should be like.

But we do not encourage MCS use unproctored administration for their civil service exams. First, at
the same time that ITC has established guidelines for unproctored testing, unproctored testing risks
violation of other existing professional guidance (See, Pearlman (2009)). Second, we believe that
MCS should launch and communicate about its civil service testing system in a manner that
deliberately and explicitly is designed to build its credibility and engender the confidence of all
applicants. As a government process, its civic reputation will be important. Unproctored
administration, especially at the outset, risks the credibility of the system especially in a cultural
setting where the use of tests does not have a long track record in personnel selection.

Translations / Adaptations

SHL has developed multiple language versions for all Verify subtests, shown in Table 62, with the
exception of Reading Comprehension, available only in International English and for Spatial Ability
and Deductive Reasoning, for which language versions are not yet reported.

Table 62. Languages in which subtests are available.


Subtest Language Versions Available

Verbal Reasoning Arabic, Brazilian Portuguese, Canadian French, Chinese (Complex), Chinese
(Simplified), Danish, Dutch, Finnish, French, German, Hungarian, Indonesian,
Numerical Reasoning
Italian, Japanese, Korean, Latin American Spanish, Norwegian, Polish,
Inductive Reasoning Portuguese, Russian, Spanish, Swedish, Thai, Turkish, UK English, US English

Mechanical Canadian French, Chinese (Simplified) , Dutch, French, German, International


Comprehension English, US English
Checking
UK English, US English, French, Dutch, German, Swedish, Norwegian, Danish,
Calculation Finnish, Italian, Indonesian

Reading Comprehension International English

Spatial Ability
Not Reported
Deductive Reasoning

For most subtests, Test Fact sheets are available in UK English and in French.

125
User Support Resources (Guides, Manuals, Reports, Services, Etc.)

In support of its employment selection tools, SHL provides extensive resources for registered users
and somewhat fewer resources for unregistered browsers. Since being acquired by CEB, SHL has
begun to position itself as a multiservice resource to organizations about People Intelligence providing
a range of services including consulting services, business management tools, assessment tools, HR
process tools, and training. SHL clearly provides access to more resources for both registered and
unregistered individuals than any other publisher of the reviewed batteries. Resources supporting
just employment related test products include

 Technical Manuals
 User Guides
 Information Overviews
 Product Description Materials
 Individual Test Fact Sheets
 Sample Report
 Practice Information (Although we are not certain whether this includes practice booklets or
just sample items.)
 Access to an SHL Library containing many, if not all, professional presentations and
publications by SHL professional
 Business tools for talent management
 Technology platforms for test delivery and applicant management

Evaluative Reviews

Perhaps because Verify is a relatively new employment testing system that emphasizes its mode of
delivery rather than the uniqueness of the constructs measured by its subtests, we have found no
reviews of Verify in common professional resources such as Buros Mental Measurements Yearbook.
Further, we have found no published evaluation of verify tests other than studies published by SHL
principals about research that preceded and informed the development of Verify. Perhaps the best
example of such a published study is Bartram (2005) which reported the extensive research effort at
SHL to identify common work performance “competencies”. These competencies, now represented
as SHL’s Universal Competency Framework (UCF) have become the primary framework by which
SHL describes the type of work for which individual subtests are considered appropriate. For
example, the Test Fact sheet for Mechanical Comprehension describes this subtest as relevant to
work roles depending on the following Universal Competency an underlying behaviors.

4.2 – Applying Expertise and Technology

 Applies specialist and detailed technical expertise


 Develops job knowledge and expertise through continual professional
development
 Shares expertise and knowledge with others
 Uses technology to achieve work objectives

Overall, our evaluation of the SHL approach to the development, administration, maintenance and
use of selection tools and its support of users/customers is superior to any other publisher for the
reviewed batteries. Our recommendations about several facets of MCS’s civil service system have
been strongly influenced by SHL’s example of professional practice.

126
DESCRIPTIVE RESULTS FOR WATSON-GLASER
Overall Description and Uses

Introduction

Goodwin Watson and Edward Glaser developed in the 1920’s an assessment process for measuring
critical thinking. Their precursor tests evolved by 1960 into standardized versions that came to be
called the Watson-Glaser Critical Thinking Appraisal (W-GCTA). This test was designed to assess
five facets of critical thinking – Recognize Assumptions, Evaluate Arguments, Deduction, Inference,
and Interpretation. In addition, W-GCTA was designed to assess critical thinking in neutral contexts
and in controversial contexts. Over the next 50 years the early Forms Ym and Zm (100 items) were
eventually replaced by somewhat shorter Forms A and B (80 items), A British version, Form C was
developed, also with 80 items. Subsequently, a Short Form was developed based only on 40 items
while continuing to assess the same five facets. More recently, in 2010 two new shortened forms, D
and E (both 40 items) were introduced to replace the previous long Forms A, B, and C (UK), which
was a British adaptation of Form B, and the Short Form. Several interests motivated the
development of Forms D and E: (a) empirical factor analytic studies had concluded that W-GCTA
scores were best explained by three factors of critical thinking rather than the five traditional
components of W-GCTA; (b) improvements in the business relevance and global applicability of all
items and the currency of controversial issues; and, (c) increase the range of scores and retain
current levels of reliability at the shorter length of 40 items. At the same time, a much larger bank of
items was developed to support an online, unproctored version of W-GCT in which each test taker
would receive a different randomized set of items. This online, unproctored administration is
currently available only in the UK.

The W-GCTA is constructed around testlets of 2-5 items each. Each testlet consists of a text
passage that provides the information the test taker analyzes to answer critical thinking questions.

This report will focus primarily on the new Forms D and E and will provide information about the very
recent online, unproctored version available in the UK. However, empirical data gathered over time
using the longer earlier Forms A and B and Short Form will be cited where empirical data provides
evidence for Form equivalence and for construct and criterion validity .

Purpose

W-GCTA was developed to provide an assessment of critical thinking for use in selection decisions,
development/career planning, interview planning, and academic assessment. While the W-GCTA
has been applied in a wide variety of settings, its application in personnel selection has been
especially focused on higher level job families and roles such as manager / executive / professional
consistent with the relatively high complexity of it items content. While the reading level of all W-
th
GCTA passages and items is intended to be no higher than the 9 grade level, the content of the
passages, especially, is intended to be representative of content that would be encountered in a
business context or would be found in newspaper media. As a result, the W-GCTA content
complexity level is intended to be comparable to complexity level of common business scenarios.

In spite of its relatively high content complexity, it is relatively easy to use and score, with scoring only
requiring that the number of correct answers be totaled for each part of the whole test.

Targeted Populations and Jobs

The 9th grade reading level and complexity of items result in W-GCTA being appropriate with more
educated populations such as candidates for managerial / executive / professional occupations.

127
To accommodate the use of W-GCTA across a wide range of occupations and levels, norms have
been developed for many occupations and job levels, In 2012, Pearson published Form D and E
norms for the following occupations:

 Accountant,
 Consultant,
 Engineer,
 Human Resource Professional,
 Information Technology Professional, and
 Sales Representative.

At the same time, Pearson also published Form D and E norms for the following job levels:

 Executive,
 Director,
 Manager,
 Supervisor,
 Professional / Individual Contributor, and
 Manager in Manufacturing / Production.

In the 2010 W-GCTA Technical Manual, Pearson reported that W-GCTA “customer base” consisted
primarily of business professionals (approximately 90%) and college students (approximately 10%).

Spread of Uses

W-GCTA has been used in a wide range of setting and purposes ranging from personnel selection to
clinical mental assessment to academic evaluation. Within personnel selection, W-GCTA has a long
history of use with managerial / executive / professional decision making. Specific uses have
included personnel selection, development feedback, training achievement, and academic readiness
and achievement.

Online, Unproctored UK Version

All of the above information applies to the online, unproctored UK version.

Administrative Detail

Table 63 shows administrative details for the current and previous forms of W-GCTA.
Numbers and Types of Items

Current and previous forms include passages and items for each of the original 5 components of
critical thinking – Recognize Assumptions, Evaluate Arguments, Deduction, Inference and
Interpretation. However, the current versions report scores on only three scales – Recognize
Assumptions, Evaluate Arguments and Draw Conclusions, where Draw Conclusions is a composite of
the shorter versions of Deduction, Inference and Interpretation. Previous factor analytic studies
demonstrated that scores on Deduction, Inference and Interpretation were sufficiently homogeneous
to constitute a single factor, now called Draw Conclusions

Time Limits

W-GCTA places less emphasis on time limits than other standardized cognitive ability batteries.
Users have the option of administering W-GCTA with or without time limits. Nevertheless, when time
limits are imposed, the same 40-minute time limit is used for both the shorter current versions as were
used for the longer previous forms. Pearson reported that in the standardization sample used in the
development of the current shorter forms the median time to completion was 22.48 minutes.
128
The UK versions of the current shorter forms have a somewhat shorter time limit of 30 minutes.

Table 63. Administrative detail about previous and current W-GCTA Forms
Administrative Information
Subscale Method of
# Items Time Limit Scoring Rule
Delivery

Watson-Glaser II
(Current)

Recognize 40 minutes for whole


Assumptions – 12 battery
Fixed tests;
Forms D, E (30 minutes for UK Raw scores as
Paper-pencil
version) number of items
Evaluate or computer
(for Paper-Pencil answered correctly.
Arguments – 12 administratio
administration) Percentile rank scores
Forms D, E n, proctored.
are provided based on
(An
or relevant
unproctored
16 industry/occupation/jo
Draw version is
Untimed b norms. Local norms
Conclusions – 5 – Deduction
available in
(for Computer or may also be used.
Forms D, E 5 – Inference the UK)
Paper-Pencil
6 - Interpretation
administration)

Watson-Glaser
(Previous)

Recognize
Assumptions 16
Forms A, B

Evaluate Raw scores as


Arguments 16 40 minutes for whole number of items
Fixed items,
Forms A, B battery answered correctly.
Paper-
Percentile rank scores
Pencil
Deduction or are provided based on
16 administratio
Forms A, B relevant
n, proctored
Untimed industry/occupation/jo
Inference b norms
16
Forms A., B

Interpretation
16
Forms A, B

Types of Scores

Raw scores are computed as the number of items answered correctly. Although all items are
multiple choice, no adjustment is made for guessing. Raw scores are converted into Percentile rank
scores and T-scores (Mean=50, SD=10.0) for reporting purposes. Percentile rank scores are
provided with respect to several norm populations including the British general population, and
several occupation groups, and several role/level groups.

Method of Delivery

Two methods of delivery are available in the US with the current Forms D and E, paper-pencil and
online. Both methods of delivery are required in the US to be proctored and consist of the same item
sets as included in paper-pencil Forms D and E. As in the US, in the UK Form D may be
administered in a proctored online mode.

129
However, in the UK a version of W-GCTA is available that may be delivered online without proctoring.
This “version” of W-GCTA does not correspond to any particular fixed form of W-GCTA because
unproctored administration requires that a different randomized “form” of a 40-item W-GCTA be
administered to each test taker. This process of creating randomized forms requires a large number
of items in an item bank that may be repeatedly sampled in a prescribed randomized manner such
that virtually every test taker is administered a unique, equivalent randomized form. This process of
item development, bank construction and test creation will be described in detail below. As of early
2013, this development process had produced an item bank of 376 quality-tested items associated
with 100 passages.

Cost

UK prices for the specific resources necessary to administer, score and receive reports are shown
below. Pricing was not available for online administration.

Item Unit Price


Test Booklet £22.00
25 Answer Forms £85.00
Scoring Key £47.50
10 Practice Tests £30.50
Individual Reports £21.00

Construct – Content Information

Intended Constructs

The W-GCTA was developed to measure critical thinking as Watson and Glaser defined it. They
defined critical thinking as the ability to identify and analyze problems as well as seek and evaluate
relevant information in order to reach an appropriate conclusion. This definition assumes three key
aspects of critical thinking:

1. Attitudes of inquiry that involve an ability to recognize the existence of problems and an
acceptance of the general need for evidence in support of what is asserted to be true;

2. Knowledge of the nature of valid inferences , abstractions, and generalizations in which the
weight or accuracy of different kinds of evidence are logically determined, and

3. Skills in employing and applying the above attitudes and knowledge.

Current W-GCTA Forms D and E organize the assessment of critical thinking around three primary
components, Recognize Assumptions, Evaluate Arguments and Draw Conclusions. Recent factor
analyses of W-GCTA scores (Forms A, B, and Short Form) have shown that previous components
Recognize Assumptions and Evaluate Arguments are each single factors and that Draw Conclusions
is a relatively homogeneous factor consisting of three previously identified components, Deduction,
Inference and Interpretation. Current W-GCTA Forms D and E continue to include passages and
items associated with each of the original five components, although in somewhat different
proportions, but report scores only on the three factors of Recognize Assumptions, Evaluate
Arguments, and Draw Conclusions..

Recognition of Assumptions is regarded as the ability to recognize assumptions in presentations,


strategies, plans and ideas.

130
Evaluate Arguments is regarded as the ability to evaluate assertions that are intended to persuade.
It includes the ability to overcome a confirmation bias.

Draw Conclusions is regarded as the ability to arrive at conclusions that logically follow from the
available evidence. It includes evaluating all relevant information before drawing a conclusion,
judging the plausibility of different conclusions, and selecting the most appropriate conclusions, while
avoiding overgeneralizing beyond the evidence. Although Draw Conclusions has been shown
empirically to be a relatively homogeneous factor, operationally the W-GCTA defines it as the
composites of scores on the facets of Deduction, Inference and Interpretation.

Watson and Glaser added a further consideration to the assessment of critical thinking. In their view
critical thinking involved the ability to overcome biases attributable to strong affect or prejudice. As a
result, they built into the W-GCTA measurement process several passages and items intended to
invoke strong feeling or prejudices. This measurement approach continues into the current forms of
W-GCTA.

Item Content

Each item is a question based on a reading passage that supports 2-6 items. These clusters of a
passage and 2 or more items are referred to as testlets. (Note: we have not located any reference to
investigations of the extent to which this constructed interdependency of items within testlets has
influenced W-GCTA measurement properties. As reported above, W-GCTA scores are number right
scores without any adaptation/weighting for this built-in item interdependence. Nevertheless, it can
be assumed that this interdependence among items within testlets reduces the effective length of the
W-GCTA assessment. This may be a possible explanation for the relatively low reliabilities reported
for the Evaluate Arguments scale in which the distinction between neutral and controversial passages
is most pronounced.)

General Item Content Specifications

Certain characteristics are prescribed for all testlets, regardless of the target component. It appears
that these characteristics have not changed over the decades of W-GCTA evolution. The passages
are required to include problems, statements, arguments and interpretations of data similar to those
encountered on a daily basis in work contexts, academic contexts, and in newspaper/magazine
content. Certainly, these passages will be laden with local culture and social context and should not
be assumed to be equivalent across cultures. As an example, the first British version of W-GCTA,
Form C, is a British adaptation of the US-based Form B.

In addition to this prescription about testlet content, testlets must also be written to be either neutral or
“controversial”. Neutral content does not evoke strong feelings or prejudices. Examples of neutral
content (typically) include weather, scientific facts and common business problems. “Controversial”
content is intended to invoke affective, emotional responses. Domains from which controversial
content may developed include political, economic and social issues. (Note: we have found no
explicit reference to any step in W-GCTA testlet development designed to evaluate the affective
valence of draft content.) Watson and Glaser concluded that the Evaluates Arguments component
was most susceptible to controversial content. As a result, few controversial testlets are included in
Recognize Assumptions and any part of Draw Conclusions. More controversial testlets are included
in Evaluate Arguments. We have found no information reporting the specific numbers or proportions
of controversial testlets targeted for each test component.

th
All testlets are developed to have a reading level no higher than 9 grade (US) and are heavily loaded
with verbal content. All content has a strong social/cultural context that in some testlets is work-like.
But there is no overall prescription that a high percentage of the testlets should contain work-like

131
content. The recent emphasis has been to develop testlet content that is more applicable across
international boundaries.

Specific Item Content Examples


This section shows example testlets taken from W-GCTA user manuals. All example testlets were
taken from US-oriented sample materials.

Recognize Assumptions
Statement:

We need to save time in getting there so we’d better go by plane.

Answer Options: Assumptions Made? YES NO

Proposed assumptions:

1.Going by plane will take less time than going by some other means of transportation. (YES, it is
assumed in the statement that the greater speed of a plane over the speeds of other means of
transportation will enable the group to reach its destination in less time.)

2. There is a plane service available to us for at least part of the distance to the destination. (YES,
this is necessarily assumed in the statement as, in order to save time by plane, it must be possible to
go by plane.)

3. Travel by plane is more convenient than travel by train. (NO, this assumption is not made in the
statement – the statement has to do with saving time, and says nothing about convenience or about
any other specific mode of travel.)

Evaluate Arguments
Statement:

Should all young people in the United Kingdom go on to higher education?

Answer Options: Strong Weak

Proposed Arguments:

1. Yes; college provides an opportunity for them to wear college scarves. (WEAK, this would be a
silly reason for spending years in college.)

2. No; a large percentage of young people do not have enough ability or interest to derive any
benefit from college training. (STRONG. If it is true, as the directions require us to assume, it is a
weighty argument against all young people going to college.)

3. No; excessive studying permanently warps an individual’s personality. (WEAK, this argument,
although of great general importance when accepted as true, is not directly related to the question,
because attendance at college does not necessarily require excessive studying.)

Draw Conclusions - Inference


Statement:

Two hundred school students in their early teens voluntarily attended a recent weekend student
conference in Leeds. At this conference, the topics of race relations and means of achieving lasting
world peace were discussed, since these were problems that the students selected as being most
vital in today’s world.

132
Answer Options: True (T), Probably True (PT), Insufficient Data (ID), Probably False (PF), and False
(F)

Proposed Inferences:

1. As a group, the students who attended this conference showed a keener interest in broad social
problems than do most other people in their early teens. (PT, because, as is common knowledge,
most people in their early teens do not show so much serious concern with broad social problems. It
cannot be considered definitely true from the facts given because these facts do not tell how much
concern other young teenagers may have. It is also possible that some of the students volunteered to
attend mainly because they wanted a weekend outing.)

2. The majority of the students had not previously discussed the conference topics in the schools.
(PF, because the students’ growing awareness of these topics probably stemmed at least in part from
discussions with teachers and classmates.)

3. The students came from all parts of the country. (ID, because there is no evidence for this
inference.)

4. The students discussed mainly industrial relations problems. (F, because it is given in the
statement of facts that the topics of race relations and means of achieving world peace were the
problems chosen for discussion.)

5. Some teenage students felt it worthwhile to discuss problems of race relations and ways of
achieving
world peace. (T, because this inference follows from the given facts; therefore it is true.)

Draw Conclusions – Deduction


Statement:

Some holidays are rainy. All rainy days are boring. Therefore:

Answer Options: Conclusion follows? YES NO

Proposed Conclusions:

1. No clear days are boring. (NO, the conclusion does not follow. You cannot tell from the
statements whether or not clear days are boring. Some may be.)

2. Some holidays are boring. (YES, the conclusion necessarily follows from the statements as,
according to them, the rainy holidays must be boring.)

3. Some holidays are not boring. (NO, the conclusion does not follow, even though you may know
that some holidays are very pleasant.)

Draw Conclusions – Interpretation


Statement:

A study of vocabulary growth in children from eight months to six years old shows that the size of
spoken vocabulary increases from 0 words at age eight months to 2,562 words at age six years.

Answer Options: Conclusion follows? YES NO

Proposed Conclusions:

133
1. None of the children in this study had learned to talk by the age of six months. (YES, the
conclusion follows beyond a reasonable doubt since, according to the statement, the size of the
spoken vocabulary at eight months was 0 words.)

2. Vocabulary growth is slowest during the period when children are learning to walk. (NO, the
conclusion does not follow as there is no information given that relates growth of vocabulary to
walking.)

Construct Validity Evidence

The ability constructs underlying the W-GCTA have remained unchanged since its inception but do
not correspond precisely to the cognitive ability constructs assessed by other commonly used
cognitive tests. Nevertheless, it is to be expected that the W-GCTA measure of critical thinking would
correlate substantially with other measures of reasoning or intelligence or general mental ability.
Table 64 presents empirical construct validity correlations for various forms of W-GCTA gathered
since 1994. The results reported in Table 64 are taken from recent W-GCTA technical manuals and
do not capture all studies of W-GCTA construct validity. Also, it should be noted that, with the likely
exception of the WAIS-IV Processing Speed Index, all other comparison tests are sources of
convergent validity evidence.

The results indicate that W-GCTA total scores correlate substantially with other convergent measures.
Based on these results, a couple of points are worth noting. First, the Short Form shows observed
convergent validities similar in magnitude to validities for the then-current longer Forms A and B.
Second, in the small 2010 study comparing W-GCTA to WAIS-IV scores, there appears to be
noticeable variation in W-GCTA scale validities, even though total score validity is similar to that in
other studies with other tests. In particular, Draw Conclusions correlates substantially higher than
either Evaluates Arguments or Recognize Assumptions with all WAIS-IV indices. This is likely the
result of the increased heterogeneity of the testlets in Drawing Conclusions compared to the other two
scales. Also, it is worth noting that none of the W-GCTA scales yielded a strong correlation with the
WAIS-IV Processing Speed index. This might be considered a supportive divergent validity result
given that the W-GCTA does not intended to assess speededness of critical thinking and is
administered either with a generous time limit or no time limit.

134
Table 64. Observed construct validity correlations with forms of Watson-Glaser Critical Thinking Appraisal
Watson-Glaser Test
Form D / UK Proctored
Study N Comparison Tests Form Form
A/B/ E Recognize Evaluate Draw Total
Short Total Assumptions Arguments Conclusions Score
Total
SAT Verbal .43
Education Majors 147-
(Taube, 1995) 194
SAT Math .39
Ennis-Weir Critical Thinking .37
Nursing students
(Adams 1999)
203 ACT Composite .53
Job Incumbents
(Pearson 2006)
41 Ravens APM .53
Job Incumbents Rust Advanced Numerical
(Pearson 2006)
452
Reasoning Appraisal. .68
Job Incumbents Miller Analogies for Prof.
(Pearson 2005)
63
Selection .70
Rail dispatchers Industrial Reading Test .53
180
(W-G 1994) Test of Learning Ability .50
219 Wesman Verbal .51
Lower Managers
(W-G 1994)
217 EAS Verbal Comprehension .54
217 EAS Verbal Reasoning .48
Wesman Verbal .66
Mid Managers (W- 208-
G 1994) 209
EAS Verbal Comprehension .50
EAS Verbal Reasoning .51
440 Wesman Verbal .54
Exec Managers
(W-G 1994)
437 EAS Verbal Comprehensions .42
436 EAS Verbal Reasoning .47
556 SAT Verbal .51
Teaching 558 SAT Math .50
applicants
(Pearson 2009) 251 SAT Writing .48
254 ACT Composite .59
WAIS-IV Full Scale IQ .31 .21 .62 .52
WAIS-IV Perceptual Reasoning .20 .25 .56
WAIS (Pearson WAIS-IV Working Memory .24 .13 .59
62
2010) WAIS-IV Verbal Comprehension .34 .10 .46
WAIS-IV Processing Speed .09 -.01 .22 p>.05
WAIS-IV Fluid Reasoning .32 .36 .67 .60

At this point in time, no construct validity studies have been reported for the UK online, unproctored
version of W-GCTA.

Testlet / Test Development

The testlet and test development procedures described below are those used recently to develop the
new, shorter Forms D and E, as well as the larger bank of items required to support the online,
unproctored mode of delivery. This development process was organized into six stages of work.
For the first three stages, Conceptual Development, Item Writing/Review, and Piloting, the procedures
were essentially the same for the process to develop new fixed Forms D and E as for the process to
develop a large item bank for unproctored online test. The later stages of development used
different procedures for fixed form development than for item bank development. These differences
will be noted below.

Testlet Development Procedures: Fixed Forms D and E and the Item Bank for Online
Unproctored Administration

135
Conceptual Development Stage

Exploratory factor analyses of Form A, B and Short Form data were undertaken separately for each
Form to evaluate the factor structure underlying the 5-component framework for Forms A, B and the
Short Form. Initial results identified three factors that were stable across Forms and two additional
factors that could not be identified but were associated with psychometrically weaker testlets, and
were not stable across Forms. Subsequent factor analyses that excluded the weaker testlets and
specified three factors revealed the three factors Recognize Assumptions, Evaluate Arguments and
Draw Conclusions, which included the previous Deduction, Inference and Interpretation components.

In addition to factor analyses of Forms A, B and Sort Form data, Pearson consulted with W-GCTA
users including HR professionals internal to client organizations, external HR consultants and
educational instructors.

These two sources of input combined to establish the objectives for the process of developing new
testlets to support new fixed Forms and the development of a larger bank of items to support online,
unproctored testing. As a result of this conceptual development stage, the following development
objectives were established:

 Ensure the new questions were equivalent to the existing items


 Ensure that content is relevant for global markets
 Ensure that content is appropriate for all demographic groups
 Ensure that the item-banked test is continuing to reliably measure the same critical thinking
constructs as other versions of the Watson-Glaser.
 Ensure that a sufficient number of new items were developed to support two new 40-items
forms (each with approximately 35% new items an 65% existing items) and the item bank
required to support online, unproctored delivery

Item Writing and Review Stage

Item writing was conducted by individuals with extensive prior experience writing critical thinking /
reasoning items. These included experienced item writers, occupational psychologists or
psychometricians with at least 15 years of experience. Detailed guidelines and specifications for
writing items were provided to each writer, and each writer submitted items for review prior to
receiving approval to write additional items. Writers were instructed to write items at a 9th grade
reading level, using the identical testlet format as used in previous forms and measuring the same
constructs as measured in the previous forms.

Subject matter experts with experience writing and reviewing general mental ability/reasoning items
reviewed and provided feedback on how well each new draft item measured the target construct,
clarity and conciseness of wording, and difficulty level.

In addition, a separate group of subject matter experts reviewed and provided feedback on how well
draft items and items from existing forms could be expected to transport or be adapted to other
countries/cultures. This included a consideration of topics or language which would be unequally
familiar to different groups.

As a final step, experimental scenarios and items intended to be business relevant were reviewed for
use of appropriate business language and situations by Pearson’s U.S. Director of Talent
Assessment.

For fixed form tests, Forms D and E, 200 new items were produced through this item writing process.
These items were based on approximately 40 new testlets, each with approximately 5 items.

136
For the item-banked online, unproctored test, 349 items were produced. (This development activity
nd
for the item-banked test took place in 2012. An additional development is currently underway in a 2
phase of development.)

Pilot Stage

The pilot stage consisted of the administration of new items to W-GCTA test takers by adding some
number of new items to each form of the existing W-GCTA test being administered. These new
items were unscored. These pilot administrations were intended to provide empirical data for
psychometric evaluations of the new items.

Fixed Forms

Each administration of an existing form (Forms A, B, and Short) included a set of 5-15 of the 200 new
items. Each set was administered until at least 100 cases had been collected. When an adequate
number of cases had been accumulated for each set, it was replaced by another set of 5-15 new
items. This process continued until all new items had been administered to an adequate number of
test takers. The average number of cases across all 200 new items was 203. Classical Test Theory
(CTT) analyses were used to estimate the psychometric properties of each new item.

Item-Banked Test

Because a larger number of items was needed to generate the item bank for online unproctored
testing, up to 40 new items were added to existing operational paper-pencil forms (Forms A, B, Short,
and the UK paper-pencil version of B) as well as to the UK online, proctored version, which is a fixed
form of computer-administered test based on the .

Psychometric Analyses of Initial Pilot Data

For both development efforts, CTT psychometric analyses were conducted for each new item. Item
difficulty, discrimination and distractor analyses were used to identify poor performing items, which
were then deleted from the development pools.

Calibration Stage

The primary purpose of the Calibration stage was to place retained new items and existing items on
common ability scales using IRT analyses. The first question was whether the full pool of items was
unidimensional enough to support IRT analyses on the whole pool or whether IRT analyses should be
carried out separately for the Recognize Assumption (R) items, the Evaluate Arguments (E) items,
and the Draw Conclusions (D) items. Factor analyses indicated a large first factor, R, and a smaller
second factor, a combination of E and D.

BILOG-MG was used to estimate IRT item parameters for both the whole pool and for the R pool and
E+D pool using both 2 and 3 parameter IRT models. The criteria for selecting between IRT models,
in spite of small samples sizes of 350 cases for many items, included:

 Goodness of fit indices for items and the model as a whole;


 Parameter errors of estimate; and
 Test-Retest reliability for modeled parallel forms based on different parameterizations

Using these criteria, a 3 parameter model with fixed guessing showed the best fit. (Note, this result
was influenced by the fact that the majority of items have only two options, yielding a 50% chance of
guessing correct.) Also sample size limitations prevented the estimation of the full 3 parameter
model. Also, test-retest reliability for modeled parallel forms favored estimates based on the whole

137
item pool over estimates based on the R and E+D pools separately. Reliabilities produced from IRT
parameter estimates based on all three subtests combined into a single whole pool ranged from .80 to
.88. Reliabilities produced from IRT parameter estimates based on R and E+D pools separately
were lower, ranging from .71 to.82. The reliability advantage of the whole pool estimates over the
separate pool estimates was the primary factor leading Pearson to conclude that the
unidimensionality assumption was supported sufficiently. Goodness of fit indices did not provide
strong support for the unidimensionality assumption.

As a result of these analyses, the final IRT model chosen was the 3 parameter model with a fixed
guessing parameter treating the whole item pool as unidimensional.

Tryout Stage: Form D

The primary purpose of the tryout stage, which was applied only to the new Form D, was to confirm
the three factors, Recognize Assumptions, Evaluate Arguments and Draw Conclusions and to
demonstrate subscale reliability. Based on a sample of 306 candidates with at least a Bachelor’s
degree, factor analysis results supported the three factor structure and internal consistency estimates
provided evidence of adequate reliability, although in the case of Evaluate Arguments the estimated
internal consistency was marginally adequate. Reliability results are reported below.

Standardization Stage: Forms D and E

Standardization analyses addressed the relationships between the new Forms D and E and the
previous Forms A, B and Short and other measures of closely related constructs. These latter
investigations of construct validity were reported above in the Construct Evidence section in Table 64.
Alternate forms reliabilities are shown below in the Reliability section. Overall, alternate forms
reliabilities between Form D and the Short From at the Total score and subscale score levels were
strong, all above .80, with the exception of Evaluates Arguments where the correlation between Form
D and Short Form was only .38, indicating a lack of equivalence.

Form E was constructed from new items following the demonstration that Form D showed adequate
factor structure and internal consistency. Form E items were selected based on item selection
criteria, information from the pilot and calibration stages and item difficulty levels corresponding with
the difficulty of From D items.
Equivalence of New Forms and Previous Forms

Following the development of Forms D and E, a development study with 636 test takers was
conducted to compare the means and standard deviations of these two alternative forms and to
assess their correlation. Table 65 shows the results of this equivalency study. Prior to this study,
other equivalency studies were conducted for various pairs of forms. These are also reported in
Table 65.

138
Table 65. Equivalency results for various pairs of Watson-Glaser test forms
Form 1 Form 2 Observed
Alternate
Study Sample
Test Score Mean SD Mean SD Forms
Size
Reliability
Form D Short Form
Total (40) 27.1 6.5 29.6 5.7 .85
Form D Standardization Study 636 Recognize (12) 8.2 3.1 5.8 2.2 .88
Evaluate (12) 7.7 2.2 6.9 1.3 .38
1
Draw (16) 11.2 3.1 17.9 4.0 .82

Forms D and E Development Form D Form E


209
Study Total (40) 22.3 6.3 22.9 5.6 .82

Short Form
Paper v. Computer Short Form (Paper)
226 (Computer)
Administration
Total (40) 29.8 5.6 29.7 5.6 .87

Form A Form B
US 12th Grade Students 288
Total (80) 46.8 9.8 46.6 9.3 .75

Form C
Form B (UK adaptation of
UK “Sixth Form” Students 53 Form B)
Total (80) 56.8 8.3 57.4 9.5 .71
1
The Draw Conclusions mean reported for the Short Form, 17.9, is almost certainly a typo because
the Draw Conclusions scale includes only 16 scored items. However, we have not been able to
confirm the correct value.

A few points are worth noting about the equivalency data reported in Table 65. First, the Total Score
alternate forms reliabilities for all 40-item tests are substantial, all being greater than .82. However,
the Total Score alternate forms reliabilities for the 80-item Forms A, B and C are lower, .71 and .75,
than those for the shorter newer forms. Given that the construct and testlet content specifications
were virtually identical for all forms, this is a remarkable result. One possible explanation is that the
equivalency studies involving Forms A, B and C were conducted recently and the item content had
become outdated to the extent that students may have responded less reliably to the older testlet
content.

Second, the computerized version of the Short Form correlated .87 with its paper-pencil counterpart
and produced nearly identical means and SDs. This result created great confidence on the
publisher’s part that online administration alone would not likely distort scores.

Third, scale scores for Evaluate Arguments correlated only .38 between Form D and Short Form.
This is a low alternate forms reliability for two scales defined and constructed in virtually identical
ways. Other forms of reliability evidence reported below strengthen the overall conclusion that
Evaluates Arguments has demonstrated marginal levels of reliability. The 2010 Watson-Glaser
Technical Manual comments on this result by noting “Because Evaluates Arguments was the
psychometrically weakest subscale in previous forms of the Watson-Glaser, the low correlation was
not surprising.” Surprising or not, this result strongly suggests Evaluate Arguments may not
contribute to the overall usefulness of Watson-Glaser scores. Construct validity results reported
above in Table 64 and criterion validity results reported below in Table 69, support this expectation
that Evaluates Arguments contributes little, if anything, to the meaningfulness and usefulness of W-
GCTA scores.

139
Item Bank Configuration for Online, Unproctored Testlets

Once new items developed for the large item bank and existing items were calibrated on the same
scale, the process of configuring the item bank was carried out. Including all new testlets and
existing testlets that survived the preceding stages of development and review, 100 passages
remained, supporting 376 specific items.

The primary objective guiding the configuration of the item bank was to ensure that all test takers
receive equivalent tests. All administered tests must have the same number of scored items, must
be a representative sample of the diverse content areas within each subscale, including some but not
all business related passages, and must produce. Also, all administered tests should avoid extreme
differences in overall test difficulty. However, given that online, unproctored tests are scored based
on IRT estimates of overall ability level, and not on number right scores, overall test difficulty as
measured by the number of incorrect answers, is not a critical consideration.

Unfortunately, Pearson has not provided a precise description of the manner in which items are
banked or selected into tests. From the information provided, it is plausible that the item bank has a
moderately complex taxonomic structure in which one dimension is type of scale (5) and a second
dimension id type of topic (e.g., business, academic, news, etc.). Presumably there are formalized
constraints on item sampling that restrict the number of easy and more difficult questions and, most
likely, restrict the tendency to over sample the highest quality items.

Pearson estimates that based just on the Phase 1 bank size of approximately 376 items (100 testlets)
over 1 trillion tests are possible. Analyses of overlap between pairs of simulated randomized 40-item
tests showed that the average amount of overlap was approximately 1 item.

Equivalency of Randomized Test Forms

An important consideration is the equivalency of randomized tests generated from the same item
bank. A study was conducted to assess the means, standard deviations and correlations between
pairs of randomized tests administered to the same test takers. In this study, each test taker
completed two 40-item randomized tests. In one condition, the two pairs of tests developed for a test
taker were constructed to be fully equivalent, including similar levels of difficulty. In a second
condition, the two pairs were constructed to be equivalent except with respect to difficulty. One test
was constructed to be easier than the test construction algorithm would allow and the other, more
difficult than would be allowed. The reason for this second condition was to evaluate the extent to
which IRT calibration would control for variation in item difficulty, as is expected. This study was
conducted in four different samples in which two sample received equivalent pairs of tests and two
sample received non-equivalent pairs of tests. Table 66 shows the equivalency results in these four
samples.

140
Table 66. Equivalency results for pairs of unproctored randomized online tests, each with 40 items.

Test 1 Test 2 Observed


Alternate
Sample Sample
Mean SD Mean SD Forms
Size
Reliability

Equivalent Pairs - Sample A 335 -.17 .80 -.12 .82 .82

Equivalent Pairs - Sample B 318 -.06 .89 -.07 .92 .88

Non-Equivalent Pairs - Sample C 308 -.17 .89 -.21 .90 .78

Non-Equivalent Pairs - Sample D 282 -.24 .91 -.24 .89 .80

Table 66 shows that equivalent tests had alternative form reliabilities of .82 and .88 and non-
equivalent (in terms of test difficulty) tests had reliabilities of .78 and .80. Also, within each sample
the pairs of tests had very similar means and SDs, regardless of the difficulty-related equivalency of
the pairs of tests. These results are strong evidence that even large differences in randomized test
difficulty are unlikely to distort estimated ability levels to any substantial degree. This empirical result
that ability estimates do not substantially vary as a function of item difficulty provides some support for
the IRT invariance assumption. However, Pearson does not report other evaluations of this
assumption. Nevertheless, the fact that tests of unequal difficulty produced ability estimates
correlating approximately .80 with one another indicates that the invariance assumption is sufficiently
supported for selection purposes.

Reliability and Inter-Scale Correlations

While equivalency studies produced alternate forms reliability estimates, reported above, other
studies of W-GCTA reliability for more recent forms have been conducted that estimate test-retest and
internal consistency reliability. Table 67 shows the results of these other recent reliability studies.

141
Table 67. Observed test-retest and internal consistency reliability estimates.
Test-Retest Estimates Internal Consistency Estimates
Study N
Total Recog. Eval Draw Total Recog. Eval Draw
Form B (80 items) 96 .73
UK Version of Form B (80
714 .92 .83 .75 .86
items)
UK Version of Form B (80
182 .93
items)
UK Version of Form B (80
355 .84
items)
Short Form 1994 (40 items) 42 .81
Short Form 2006 (40 items) 57 .89
Form D (40 items) 1,011 .83 .80 .57 .70
UK Version of Form D (40
169 .75 .66 .43 .60
items)
UK Version of Form D (40
1,546 .81
items)
Form E (40 items) 1043 .81 .83 .55 .65
UK Unproctored Online (40
2,446 .90 .81 .66 .81
items)
UK Unproctored Online (40
355 .82
items)
UK Unproctored Online (40
318 .84
items)
UK Unproctored Online (40
318 .86
items)

A number of points can be made about the reliability estimates reported in Table 67. First, the
original Form B shows the lowest test-retest reliability while the UK adaptation of Form B shows the
highest of the internal consistency reliabilities. It appears the adaptation of testlet content to the
British context improved reliability. Second, the Evaluates Arguments subscale is always the least
internally consistent of the subscales. The distinctive feature of Evaluates Arguments that it includes
a much larger proportion of controversial passages invites speculation that this feature harms its
psychometric properties. This continues the pattern of this subscale demonstrating the poorest
psychometric properties. Third, in the shorter forms, Recognition of Assumptions always shows the
highest internal consistency, while having fewer items, 12, than the Draw Conclusions subscale,16.
Fourth, the internal consistency of the randomized forms constructed for unproctored online
administration demonstrate similarly strong levels of reliability compared to the fixed forms.

Some evidence is available about the correlations among the 3 subscales in the more recent shorter
forms. Table 68 shows inter-subscale correlations in five separate study samples. Perhaps the
most notable pattern of results is that Evaluate Arguments correlates the least with other subscales in
the US-oriented Form D and Short Form while, in contrast, it correlates highest with Draw
Conclusions among all inter-subscale correlations for subscales that have been developed / adapted
for use in the UK. This is unlikely to be due to differences in currency between US and UK oriented
Forms since all versions have been developed within the past several years, with the possible
exception of the UK adaptation of the older Form B. It is not clear when that development took place.
Also, the results for the UK Adaptation of Form B are not directly comparable to the other within
subscale correlations because the Draw Conclusion subscale in the UK Form B adaptation is a non-
operational composite of three subscales, Deduction, Inference and Interpretation, making it 60% of
the total raw score whereas it is only 40% of the total raw score in the shorter tests..

142
Table 68. Observed correlations among the three subscales of the newer shorter forms.
Subscales
Study N Subscales
Total Recognize Evaluate Draw
Total (40)
Recognize (12) .76
UK Adaptation of Form D 169
Evaluate (12) .73 .32
Draw (16) .67 .28 .40

Total (80)
Recognize (16) .74
UK Adaptation of From B 714
Evaluate (16) .78 .40
Draw (48) .95 .58 .66

Total (40)
Recognize (12) .86
UK Unproctored Randomized Tests 2,446
Evaluate (12) .73 .47
Draw (16) .86 .58 .63

Total (40)
Recognize (12) .79
US Form D 636
Evaluate (12) .66 .26
Draw (16) .84 .47 .41

Total (40) .85 .68 .43 .80


Comparison of Form D to Short Form Recognize (12) .74 .88 .26 .48
 Row subscales are Short Form 636
 Column subscales are Form D Evaluate (12) .41 .24 .38 .36
Draw (16) .75 .50 .37 .82
The correlations inside the ovals are alternate forms reliabilities reported in TableW3.

Criterion Validity

Table 69 presents criterion validity results for several recent studies in which some Form of W-GCTA
was administered to applicants or employees or students for whom some work or academic
performance measure was also available. A number of key points can be noted based on this
compilation of observed criterion validities. First, W-GCTA demonstrates substantial positive validity
with respect to a wide range of performance measures. Second, little criterion validity evidence has
been reported recently for the three subscales. But what evidence has been reported shows that
Draw Conclusions has higher criterion validity than both Recognize Assumptions and Evaluate
Arguments. This raises the prospect for Evaluates Arguments, especially, that it contributes little to
the usefulness of the W-GCTA in personnel selection applications. Along with this evidence of
relatively low validity, the evidence of somewhat lower reliability levels reported above in Table 65 for
alternate forms reliability and in Table 67 for internal consistency draw attention to the possibility that
Evaluates Arguments is only marginally adequate psychometrically. Third, the single criterion validity
study of the unproctored online form strengthens the empirical foundation for the psychometric
properties of unproctored ability testing.

Overall, there is strong evidence that W-GCTA Total score predicts work and academic performance.

143
Table 69. Observed criterion validities for W-GCTA forms.
UK Form E/D
Form Unproc
Study N Criterion
A/B tored Total
Trial Recognize Evaluate Draw
Score
Assessment Center Assessor rating of Analysis .58
Ratings – Retail (Kurdish 71
& Hoffman 2002) Assessor rating of Judgment .43
41 Semester GPA .59
Freshman nursing classes
(Behrens 1996)
31 Semester GPA .53
37 Semester GPA .51
Education majors
(Gadzella et al 2002
114 GPA .41
Education majors (Taube 147-
1995) 194
GPA .30
Educational Psych Course grades .42
students (Gadzella et al 139
2004) GPA .28
Job incumbents (Pearson
2006)
2,303 Organization level achieved .33
Total AC performance (sum of all 19
Assessment Center dimensions)
.28
?
(Spector et al 2000)
Overall Performance (single rating) .24
Job incumbents (Pearson Supervisor rating of Problem Solving .33
142
2006) Supervisor rating of Decision Making .23
Supervisor rating of Problem Solving .40
Supervisor rating of Decision Making .40
Government analysts Supervisor rating of Professional/Technical
(Ejiogu, et al., 2006)
64
Knowledge & Expertise
.37
Total Score .39
Overall Potential .25
Educational Psych Mid-term exam score .42
428
students (Williams 2003) Final exam score .57
UK Bar Professional
Training Course (2010)
123 Final Exam Grade .62
UK Bar Professional
Training Course (2011)
988 Final Exam Grade .51
Teaching applicants
(Watson-Glaser)
1043 GPA for last 3 years of undergraduate study .18 .16 .24 .26
Supervisor rating of Core Critical Thinking .23 .11 .33 .30
Supervisor rating of Creativity .17 .10 .31 .26
Insurance managers
(Watson-Glaser)
65 Supervisor rating of Job Knowledge .10 .24 .21 .24
Supervisor rating of Overall Performance .02 -.03 .26 .12
Supervisor rating of Overall Potential .09 .14 .37 .27

Approach to Item / Test Security

Historically the Watson-Glaser publishers have relied entirely on proctored administration and the
appropriate training of those proctors to ensure the security of item and whole tests. There is no
indication that testlets / items were frequently, or at all, replaced with fresh items. Until the 2012
Pearson UK Manual for Watson-Glaser, there has no indication I any sources we have seen that
banks of items have been developed to support earlier Forms A, B, C, D, E or the Short Form beyond
the items that were initially placed on those fixed forms.

Beginning in 2010, approximately, Pearson launched the development of an item bank to support the
requirements of unproctored online administration of a new “form” of W-GCTA. The primary purpose
of this strategic shift to enable online administration, with or without proctoring, has been to
accommodate changing expectations, communications capabilities, and client requirements, all of
which have increased the demand for online assessment and the expectation that such online
capability not be constrained to environments that are suited to proctoring.

144
Pearson’s approach to unproctored online testing is similar in many ways to SHL’s approach with
Verify, although there are differences. The major elements of Pearson’s approach are described
here:

1. Randomized forms of tests. Each test taker will be administered a test of 40 items and,
presumably, a variable number of passages required to support those 40 items. The 40
items selected for each test taker are taken from a finite bank of items (currently
approximately 400 items) Pearson has estimated that, given the current bank size, any two
randomized tests will overlap by approximately 1 item, on average.

2. Test Construction to Ensure Equivalence. The 40 items are selected by an algorithm


from the bank of items to satisfy a number of requirements and constraints to ensure
equivalence. These include:
a. Each test will contain 12 Recognize items, 12 Evaluates items and 16 Draw items.
b. Items will be sampled from the bank taxonomy to ensure that at least some
(unknown) number of items are based on business content.
c. Items of high or low difficulty will be avoided to better ensure tests will have similar
levels of overall difficulty. (However, early studies have shown the IRT ability
estimation to be very robust against large differences in test difficulty.)
d. (Presumably) sampling rules will protect against over-sampling the better quality
items.

3. Bank Construction Practices. Banks of items are currently being developed in at least 2
phases. Phase 1 has produced 376 items. The item bank for unproctored online
administration includes new items as well as existing items that are included on current fixed
forms of W-GCTA tests. Using a 3-parameter model, with a fixed guessing parameter, all
items are analyzed in the same unidimensional pool. (IRT estimates are not developed
within Recognize, Evaluate or Draw content categories.) Newly developed items must first
satisfy CTT psychometric requirements for item quality including difficulty, discrimination, and
distractor characteristics before they are available to be trialed for IRT estimation purposes.

No information has been available about the nature of the bank taxonomy that governs the
process of randomized sampling of items. We have inferred from the 2012 UK Technical
Manual that there are at least two dimensions to the item taxonomy. One dimension is
certainly the 5 components of Watson-Glaser’s model of critical thinking, Recognize
Assumptions, Evaluate Arguments, Inference, Deduction and Interpretation, the last three of
which comprise Draw Conclusions. A second dimension may well be a categorization of item
content into subcategories such as business content, school content, news/media content,
and so on. Presumably sampling rules govern the number of items to be randomly selected
from within each subcategory of the item taxonomy. But we have not been able to confirm
the details of the item taxonomy and the sampling rules.

4. Proctored Verification Testing. Pearson recommends that clients verify satisfactory


unproctored test results by retesting candidates prior to making job offers. The verification
test is proctored and its results are compared to the previous unproctored result by Pearson
algorithms to evaluate the likelihood of cheating and make recommendations to the client
about managing the discrepancy.

5. Bank Maintenance. No information is yet available that describes the technical details of
Pearson’s approach to bank maintenance. Given the enhanced prospect for cheating in
unproctored environments, it is likely that Pearson uses some form of ongoing item review to
inspect item characteristics for evidence of drift attributable to cheating.

145
Translation / Adaptation

W-GCTA tests are available in the following 12 languages and contexts.

 UK English
 US English
 Australian English
 Indian English
 Chinese
 Dutch
 German
 French
 Japanese
 Korean
 Spanish
 Swedish

The Pearson expert interviewed for this project,. Dr. John Trent, had no information about Pearson’s
process for creating different language versions. Given the high dependency on social/cultural
content relating to business, school and media contexts and the need to ensure that passages are
either neutral or controversial, Pearson’s item development process relies heavily on the review of
draft items by professionals with considerable experience with the Watson-Glaser test instrument.
Certainly, equal expertise and experience is required for successful translations. Experienced
translation organizations such as Language Testing International are likely to be capable resources.

User Support Resources (E.g., Guides, Manuals, Reports, Services)

Pearson (Talents Lens) provides a wide range of user support resources for the Watson-Glaser
battery. (We note here that the assessment options available in the US and UK are not the same.
The unproctored online version of Watson-Glaser is currently available only in the UK. As reported
by Dr. John Trent, Pearson is currently working on making the UK assessment capabilities available
in the US.)

The Watson-Glaser resources available through the Talent Lens distribution channel include

• Technical Manuals
• Sample Reports (Interview, Profile, Development)
• Frequently Asked Question documents
• User Guides
• Norm documentation
• Case Study descriptions
• Validity Study descriptions
• Information for Test Takers
• Sample Items
• White Papers
• Regular Newsletters (must be registered)

Overall, Pearson / Talent Lens provide strong and comprehensive support information to users. We
also found that it was relatively easy to obtain access to Pearson psychologists for more detailed
information about Watson-Glaser.

146
Evaluative Reviews

Many reviews of previous Watson-Glaser forms have been conducted. We focus here on the four
most recent reviews of 80-item Forms A and B (Berger, 1985; and Helmstadter, 1985) and the 40-
item Short Form, S (Geisinger, 1998; Ivens, 1998). To our knowledge, no evaluative reviews have
been published about the most recent shorter versions, Forms D and E or their UK counterparts.

Certain common themes have emerged from these reviews. Reviewers generally view the Watson-
Glaser definition of critical thinking as a strength, although some have criticized the lack of construct
validity evidence to further develop and refine the specific meaning of the measure. Also, the large
amount of empirical criterion validity evidence that has been reported in earlier manuals for Forms A
and B especially has been praised. Geisinger, in particular noted that most of these studies correlate
Watson-Glaser scores with academic outcomes, rather than organization or work performance.
Broadly, reviewers agree that Watson-Glaser predicts important outcomes in both academic and work
settings.

The criticisms of Watson-Glaser focus on lower than desired reliability and on the usefulness of the
use of controversial topics. Both Berger (1985) and Helmstadter (1985) describe the Form A and B
reliability levels as “adequate – but not outstanding”. Geisinger (1998), in an especially insightful
review, criticizes the original Short Form, S, for insufficient score discrimination in the most popular
range of scores. This problem results from a combination of marginally adequate reliability in the
Short Form and significant range restriction in observed scores. He notes that 95% of the Short
Form scores fall within a 15-point range, 26-40 within the full 0-40 range of possible scores. As a
th
result, a test taker with a 60 %ile score of 34, would have a 95% confidence interval from 29.7 to
th th
38.3, which ranges from the 19 %ile to the 99 %ile. As a result, reliance of percentile scores could
lead to unreliable decisions. To Watson-Glaser’s credit, they appear to have given great weight to
this criticism and established an objective for the development of Forms D and E that the range of
observed scores should be increased.

Our own evaluation of this issue of moderate reliability is that the scores for the US versions of the
new 40-item Forms D and E appear to have similar reliability to the previous forms but that the UK
versions appear to have somewhat higher reliabilities. In either case, the reliability levels for Total
scores are adequate for use in selection processes, but the reliability levels in some studies are not
adequate to rely on subscores for selection purposes. Further, the pattern of relatively low
reliabilities and validities for the Evaluates Arguments subscale, which is heavily loaded with
controversial content, raises a question about the meaning and measurement properties of the so-
called controversial content. Berger (1985) shares the same question about the meaningfulness of
the controversial content. In an interview with Pearson’s Dr. John Trent about Watson-Glaser, he
expressed the view that this subscale relies not only on cognitive components but on personality
components. (This may help to explain why some “convergent” validity results reported in Watson-
Glaser technical manuals describe correlations with personality assessments.)

Loo and Thorpe (1999) reported an independent factor analytic study of the Short From and reached
a very similar conclusion that (a) internal consistency reliabilities were “low” for the five subscales, and
(b) principal components analyses did not support the five subtest structure. It also appears Watson-
Glaser sought to address this issue of factor structure by revising the subscale structure from 5 to 3 in
the development of Forms D and E. We would note, however, that at the item level no changes were
made to the intended content. Controversial items were retained and items capturing the five original
facets were retained. The measurement modification in Forms D and E was that the three facets of
Interpretation, Inference and Deduction are combined into a single scored scale, Draw Conclusions.

Overall, our evaluative assessment of Watson-Glaser forms is that they are valid predictors or
education and work outcomes but their construct meaning is not yet clear and the inclusion of
controversial content may not improve the measurement or prediction properties of the battery.

147
SECTION 6:
INTEGRATION, INTERPRETATION AND EVALUATION OF RESULTS

This section follows the presentation of extensive information about each target battery by integrating
that information into comparative assessments of battery qualities, interpreting the potential
implications for MCS’s civil service test system and evaluating the features of batteries as possible
models for MCS civil service tests. These three primary purposes are served by organizing this
section into eight sections, each of which addresses a particular aspect of cognitive batteries
themselves or their use that is likely to be an important consideration for MCS’s planning,
development and management of its civil service test system. These eight sections are:

 The question of validity


 Test construct considerations
 Item content considerations
 Approaches to item / test security and methods of delivery
 Item and test development
 Strategies for item management / banking
 Use of test results to make hiring decisions
 Considerations of popularity, professional standards and relevance to MCS interests

It is important to note that none of the commercial batteries reviewed in Study 2 is being used for civil
service hiring purposes. None of the publishers’ sources or other sources describes the manner in
which the target batteries would be best applied for purposes of civil service selection. The MCS civil
service testing system will have requirements and conditions that none of the target batteries has
accommodated. As a result, it is not our intention to identify a “most suitable” model from among the
target batteries. Rather, we will use information about key features of the batteries to describe the
ways in which those features may have value (or not have value) for MCS’s civil service system.

The Question of Validity


The battery-specific summaries report a wide variation in amount of available construct and criterion
validity information. At the low end, PPM has a modest amount of construct evidence with respect to
other reference subtests but no criterion validity evidence. (This remarkable feature of PPM has
been confirmed in discussion with Hogrefe’s PPM expert.). Our own effort to locate published
criterion validity studies for PPM has found none. At the high end, GATB is perhaps the most
validated battery of all cognitive tests used for selection, with the possible exception of the US Armed
Services Vocational Aptitude Battery (ASVAB). EAS also has an extensive record of criterion validity
evidence. Similarly, Watson-Glaser and DAT for PCA have extensive records of construct and
criterion validity evidence with respect to both work and academic performance. Both Verify and PET
are somewhat more recent batteries for which a modest amount of empirical validity evidence is
available.

Two important points can be made about these differences between batteries. First, this difference in
the amount of available validity evidence does not suggest differences in their true levels of criterion
validity. Although PPM’s publishers have not met professional standards for gathering and reporting
evidence of criterion validity, we are confident in our expectation that PPM has levels of true criterion
validity very similar to GATB or any of the other well-developed batteries we are reporting on. We
should note that this was not an “automatic” conclusion. Rather, this conclusion followed from an
adequate amount of evidence that PPM is a well-developed cognitive battery reliably measuring the
abilities it was intended to measure.

Second, for well-developed cognitive ability batteries, the question of criterion validity has largely
been answered by the large volume of validity evidence that has been aggregated and reported in
148
meta-analytic studies especially over the past 40 years (See, e.g., Schmidt & Hunter, 1998). These
results can be summarized in any number of ways. The summary that may be most relevant to this
Study 2 purpose of focusing on the relative merits of different subtest types is that for any well-
developed and reliable cognitive ability battery of subtests measuring different abilities, most
combinations of 3 or 4 subtests will yield a true operational criterion validity of approximately .50.
th th
Adding a 5 or 6 subtest to the composite generally will not increase criterion validity.

There is an important implication of these high level points once the decision is made (a) to develop a
civil service selection test assessing cognitive abilities and (b) that these subtests will be
representative of the types of subtests that produced the meta-analytic results. This implication is
that the remaining decisions about the development of items and subtests such as specific target
constructs, specific types of item content, and specific ways of tailoring the use of the subtests to “fit”
with specific job families will not depend in an important way on considerations of criterion validity.
The one exception to this overall implication is that recent evidence (Postlethwaite, 2012) has shown
subtests assessing crystallized abilities have, on average, higher criterion validity than subtests
assessing fluid abilities. This has important implications for the design of item content.

As a consequence of this implication, the discussions below about important considerations in the
design and development of items and subtests will make little reference to the consideration of
criterion validity.

Test Construct Considerations

The first consideration is the nature of the abilities measured by the reviewed batteries. (Note, for
comparison sake we have included the OPM subtests used in the USA Hire civil service employment
system.). Table 70 shows a method of classifying the subtests in each battery into one of five
categories of work-related cognitive abilities. It is important to note that these five categories do not
correspond to theoretical taxonomies of cognitive ability such as the CHC model of cognitive ability.
While it may be feasible to assign individual subtests to specific CHC abilities, none of the subtests
among these seven batteries was designed specifically to measure any particular broad or narrow
ability within the CHC model. Rather, it might be best to interpret these as five categories of work-
related cognitive ability. While each battery has its own origins and initial conceptual rationale, these
categories have emerged over time as clusters of subtests that have shown evidence of predictive
validity and, broadly speaking, can be associated with job ability requirements in the language
commonly used in job analysis methods designed to describe ability requirements. This distinction
between categories of work-related cognitive ability and theoretical taxonomies of cognitive ability is
not a criticism of the review batteries. It is not likely a source of imperfection or loss of validity.

(Note, in fairness to many of the original test developers for these and other cognitive batteries
designed for selection purposes, the developers were often aware of and informed by theoretical
taxonomies of cognitive ability. But they were likely aware that reliance on the structure of these
taxonomies did not lead to tests with more criterion validity. Of course, in some cases such as
Watson-Glaser and PPM, the original developers were acting on their own theories of cognitive
ability.)

149
Table 70. Subtests in the target cognitive test batteries.
Differential
Aptitude General
Professional Employee Power and OPM
Type of Test--- Aptitude Watson-
Employment Aptitude Performance Verify USA
Subtest Personnel Test Battery Glaser II
Test Survey Measures Hire
and Career
Assessment
Verbal Verbal
Reasoning Comprehension Verbal Verbal
Comprehension Reasoning Reasoning
Verbal Reading
Language Verbal Vocabulary
(Crystallized) Comprehension
Usage Reasoning Verbal Reading Reading
Reasoning Comprehension
Spelling Word Fluency
Numerical Numerical
Computation Numerical
Quantitative Ability Computation
Quantitative Reasoning
Problem Numeric Ability Math
(Crystallized) Arithmetic
Solving Numerical Numerical
Reasoning Calculation
Reasoning Reasoning

General/ Data Applied Power


Inductive
Abstract Reasoning
Interpretation Abstract Symbolic
Reasoning Reasoning Reasoning Perceptual
Deductive
Reasoning Reasoning
(Fluid) Reasoning
Mechanical
Spatial / Reasoning Three- Mechanical Mechanical
Space
Mechanical Dimensional Understanding Comprehension
Visualization
Aptitude Space Space
Relations Spatial Ability Spatial Ability
(Mixed)

Name
Comparison

Visual Speed Mark Making


and Accuracy
Processing Tool/Object Processing
Matching Speed Checking
Speed / Clerical Speed Visual Pursuit
Accuracy and Accuracy
Manual Speed Manual
and Accuracy Dexterity
 Place
 Turn
 Assemble
 Dissemble
Recognize
Assumptions

Evaluate
Other Arguments
Judgment

Draw
Conclusions

An inspection of Table 70 reveals a number of key points about these employment tests. Setting
Watson-Glaser aside for this paragraph, the first point is that all batteries include subtests of Verbal
and Quantitative abilities. Some of these subtests are tests of learning achievement, virtually, such
as Numerical Computation, Vocabulary, and Spelling. Some of these tests are a composite of
reasoning with specific learned knowledge, such as Verbal Reasoning and Numerical / Quantitative /
Arithmetic Reasoning. Second, all batteries evaluate reasoning ability either in an abstract form
(most) or in the context of verbal / quantitative content. Third, most batteries include some form of
spatial and/or mechanical subtest. In most cases, these subtests are intended to assess reasoning
in a specific context, either an abstract spatial context or a learning-dependent content of mechanical
devices. To our knowledge none of these subtests was designed to measure knowledge of
mechanical principles / properties, with the possible exception of PPM’s Mechanical Understanding
subtest, which was designed specifically to be a “Performance” test (influenced by learned content)
rather than a “Power” test of reasoning. Fourth, the Processing Speed category represents a variety
of different approaches to the assessment of cognitive and/or psychomotor processing speed. These
are highly speeded tests in that short time limits are allowed to complete as many easy items as
possible where each item requires some form of visual inspection or completion of a manual task. In

150
general, these speeded tests were designed to be used either for clerical / administrative jobs
requiring rapid processing of less complex information or for manufacturing assembly jobs requiring
high volumes of manual construction or placement of pieces. Our overall observation is that these
types of subtests are being used less frequently as automation replaces simple tasks and as job
complexity, even for entry-level administrative work increases. For example, Pearson recently
dropped its Clerical Speed and Accuracy test (and its Spelling test) from its DAT for PCA battery.

Finally, the Other category contains two distinctively different types of tests, among the reviewed
cognitive batteries. First, the Watson-Glaser battery is quite different in many respects from the other
batteries. First, it is designed to measure critical thinking skill, which the developers viewed 85 years
ago as a learned skill. The current developers continue to adhere to precisely the same construct
definition. It is very likely that the most common application of Watson-Glaser has been and is in
training / educational programs in which critical thinking skills are being taught. These have included
academic programs for training medical professionals (medical diagnosis being viewed as critical
thinking), lawyers, philosophers, managers/executives and psychologists among many others.
Approximately 2/3 of the predictive validity studies reported above are in educational settings.
Second, consistent with the view that critical thinking is a learned skill, the developers continue to
systematically include controversial topics in all of the subtests, but especially in Evaluates
Arguments. To our knowledge, this is the only non-clinical cognitive battery commonly used for
selection that manipulates the affective valence of item content. The presence of controversial
content is, in effect, intended to represent a difficult ”critical thinking” problem and is intended to serve
the purpose that is often served by difficult items. Our observation from the psychometric and validity
information presented above is that it is quite possible the Evaluates Arguments subtest, which
deliberately includes the highest percentage of controversial items, may not be a psychometrically
adequate subtest especially when Watson-Glaser is used for selection purposes. To our knowledge,
no research has been conducted on the incremental value of the controversial topics. Setting aside
our concerns about the value of controversial topics, Watson-Glaser might be considered a test of
verbal reasoning in which the problem content is highly loaded with business/social/academic content.
From that perspective, many of the Watson-Glaser testlets appear to represent the most work-like
content of any of the other reviewed batteries, except for the OPM Judgment subtest.

The USA Hire website, https://www.usajobsassess.gov/assess/default/sample/Sample.action, shows


sample items for all OPM subtests. The sample item for the OPM Judgment test is shown here.

151
Watch the following video. Choose the most AND least effective course of action
from the options below.

Step 1. Scenario
Barbara and Derek are coworkers. Barbara
has just been provided with a new
assignment. The assignment requires the
use of a specific computer program. Derek
walks over to Barbara's cubicle to speak to
her.
If you were in Barbara's position, what
would be the most and least effective
course of action to take from the choices
below?

Step 2. Courses of Action Most Effective Least Effective


Try to find other coworkers who can
explain how to use the new program.
Tell your supervisor that you don't know
how to use the program and ask him to
assign someone who does.
Use the program reference materials,
tutorial program, and the help menu to
learn how to use the new program on your
own.
Explain the situation to your supervisor
and ask him what to do.

Second, the OPM Judgment subtest is a situational judgment test for which, we believe, several
versions have been developed where each version is in the context and content of work problems
characteristic of a particular job family. The distinction between the OPM Judgment subtest and the
Watson-Glaser work-oriented critical thinking items may be largely in the definition of the correct
answer. Watson-Glaser, like most other tests designed to measure cognitive ability defines the
correct answer by applying objective rules of inference / deduction such that the correct answer is
specified by rule, not by judgment. Even if imprecision is introduced into Watson-Glaser items by
careful language or deliberately incomplete or vague information, it is our understanding that the
correct answer is found by applying specifiable rules of logic, not by appealing to subject matter
expert judgment. In contrast, it is possible – we are not certain – that the correct answers to OPM’s
(situational) Judgment subtest are specified by the concurrence of subject matter experts. If this is
the case, then the OPM Judgment test is much closer to a job knowledge test than is the Watson-
Glaser test, where the judgment required by the OPM test is judgment about the application of
specific job information and experience to a specific job-like problem. Overall, we do not recommend
that MCS develop situational judgment tests specific to particular job families / roles. The primary
reasons are (a) not enough validity is likely to be gained to be worth the substantial additional cost
required to maintain such tests and (b) such tests place significant weight on closely related previous
experience, which may not be the best use of cognitive testing.

152
Other than our concerns about Watson-Glaser’s distinctive content and OPM’s use of a situational
judgment test requiring (apparently) subject matter expert definitions of correct answers, we judge all
the subtests in the reviewed batteries as having approximately equally well-founded construct
foundations, even if they are not directly derived from theoretical taxonomies of cognitive ability. The
diverse batteries have equally well-founded construct foundations for three basic reasons. First, their
construct foundations are very similar. All assess verbally oriented abilities, quantitatively oriented
abilities, except Watson-Glaser, and reasoning either abstract or learning-loaded or both. Also, most
assess processing speed and accuracy and spatial / mechanical abilities. However, both of those
construct categories may have little relevance to MCS’s civil service system due to the lack of job
families that are typically associated with these specific abilities. Second, the long history of
predictive validity evidence has shown that virtually any combinations of 3 or 4 different cognitive
abilities tests produce maximum validity levels because the composite measures formed by such
groups effectively assess general mental ability. Third, job analytic methods for analyzing ability
requirements for at least moderately complex jobs will virtually always identify verbal abilities and
reasoning/decision making/problem solving/critical thinking abilities, whether abstract or learning-
loaded.

Our view expressed above that it would not be useful to attempt to reconceptualize the constructs
underlying cognitive tests used for selection has an important exception. Postlethwaite (2012)
conducted an extensive set of meta-analyses of work-related criterion validities for measures of fluid
ability (Gf) and crystallized ability (Gc). These analyses demonstrated that crystallized ability shows
higher levels of criterion validity than fluid ability. The likely explanation is based on two primary
factors. First, job/training performance is most directly affected by job/training knowledge. Cognitive
ability predicts performance because it enables learning of job/training knowledge. Crystallized
ability captures the ability to perform using learned information. Presumably measures of job-related
knowledge are more directly predictive of job/training performance. Second, Gc measures of
acquired knowledge also capture variance due to differences in effort, motivation and disposition to
learn the content. Such variation is also expected to add incremental validity to the relationship
between cognitive ability alone and job/training performance. Because this theory-based distinction
relates to criterion validity differences it is an important construct distinction to consider when
evaluating the constructs underlying the reviewed batteries.

In order to evaluate the extent to which the reviewed batteries include measures of fluid ability and
crystallized ability, Figure 1 displays each subtest included in the reviewed batteries for which a
determination could be made by the authors of this Report of each subtest’s Gf – Gc characteristic.
Subtests designed to measure processing speed are excluded from this Figure. (Note, Figure 1 also
displays information about work-related item content which is discussed below.) In Figure 1, subtests
shown at the top are judged to be measures of crystallized ability; subtests listed at the bottom are
judged to be measures of fluid ability. Within the groupings at the top (crystallized) and bottom (fluid)
the order from top to bottom is arbitrary. We do not intend to convey that subtests measure
crystallized or fluid ability to some degree. Rather, following Postlethwaite (2012), subtests are
classified as measuring either crystalized or fluid ability.

153
Two key points can be made about the distribution of Gc and Gf among these subtests. First, both
are commonly represented among the reviewed batteries with three times as many Gc measures as
Gf measures. By design, PPM contains a higher percentage than other batteries of Gf subtests.
Watson-Glaser, PET and OPM batteries contain no Gf subtests. Regarding this point, it is interesting
to note that both PET and Watson-Glaser are designed for and commonly used for higher level jobs
and positions such as manager, executive and professional which might be expected to place more
weight on more general, broadly applicable reasoning and thinking abilities. Certainly, both PET and
Watson-Glaser are designed to assess such thinking skills but they accomplish that purpose with item
content that is high complexity verbal content using moderately work-like content, except for PET
Quantitative Problem Solving, which contains moderately complex arithmetic computation problems.

Second, the PPM distinction between Power tests and Performance tests, which is similar to the Gf -
Gc distinction, does not align exactly with the Gf – Gc distinction. In particular, (a) the Spatial Ability
subtest, which PPM identifies as a Performance subtest is a Gf measure (novel 2-d figures requiring
mental rotation), and (b) the Verbal Reasoning subtest, which PPM identifies as a Power subtest, is a
Gc measure (complex written paragraphs of information from which information must be derived).
Our own view of this misalignment is that it is simply a misclassification by the PPM developer. It is
clear the developer intended Power tests to be very much the same as Gf tests, describing Power as
“…reasoning, where prior knowledge contributes minimally…” and Performance tests to be closely
aligned with Gc, describing Performance as “…relates more strongly (than Power) though not
154
inevitably, to the presence of experience…”. On this point, it is worth noting that verbal content alone
does not imply a Gc measure. A subtest using verbal content using simple words familiar to the test-
taker population can be a Gf measure.

Overall, the reviewed batteries, including the briefly referenced OPM subtests, are representative of
an effective overall construct strategy that places more emphasis on subtests measuring crystallized
ability than on fluid ability subtests. However, it is important to note that most batteries also provide
Gf subtests. While none of the publishers have provided an explicit rationale for this common
decision about target constructs, it is likely that this decision stems from the intention to develop a
battery of a modest number of subtest that are viewed applicable across a very wide range of job
families. Subtests of fluid ability may be viewed as supporting this objective of having broadly
applicable subtests.

Item Content Considerations

Separate from decisions about target constructs, subtests also vary with respect to item content. For
the purposes of this project evaluating employment tests, the two primary content variables are (a)
work relatedness, and (b) level. Developers also often distinguish between verbal and non-verbal
content but that distinction sometimes means the same thing as the distinction between learned and
novel / abstract content. For example, developers sometimes classify arithmetic subtests as “verbal”.
For our purposes in describing the variation in item content, we will treat the distinction between
verbal and non-verbal as, effectively, the same as the Gc – Gf distinction. Level of content can have
two separate but related meanings (a) reading level as indexed by some standardized reading index
and (b) content complexity or difficulty. For example, SHL varies the level of Verify subtest content
depending on the level of the target role/job family by changing the range of item theta values
included in the subtest. Higher level subtests include items within a higher range of theta values.

Work Relatedness of Content

To describe the variation in item content, Figure 1’s horizontal axis represents the range of work-
relatedness from work-neutral content on the left to work-like content on the right. (Level will be
treated separately because level can be difficult to “see” in item content.) A high level of work-like
content can vary considerably depending on the nature of the target job family. For example, a test
of mechanical understanding could have a high level of work-like content by relying on the machines,
tools, and equipment used in the target job family. In contrast, a high level of work-like content for a
Reasoning test targeting an upper management role might include complex business problems. It
should be noted, however, that among these reviewed batteries, no subtest has such a high level of
work-like content that items are samples of specific work tasks performed in a target job. Rather,
even the most work-like items among these batteries are abstracted from and described in a more
general form than would be the case for an actual job task. The clear reason for this is that all of
these reviewed batteries, except Verify, are designed to be suited to a range of jobs and positions and
these ranges without tailoring the content of subtests to different job families. Verify tailors subtest
content to target job level by adjusting the theta range for items that may be sampled from the item
bank. Higher theta values are sampled for higher level jobs. (Also, although the OPM subtests were
not formally reviewed in this project, it appears that OPM developed different versions of each subtest
in which tem content was tailored to specific job families. None of the reviewed batteries tailor
subtest content in this manner.)

A number of key points can be made about the representation of item content along the horizontal
axis. First, by definition, no work-like content is present in any subtest measuring fluid ability. We
assume all work-like content would be learned content. Certainly, it’s conceivable that novel content
(Gf) for one applicant population may be learned, work-like content (Gc) for another. For example,
two-dimensional drawings of shapes in a spatial reasoning test could constitute job knowledge for,
say, a drafting job in an architectural firm. But for such a population of test takers, this type of content
155
in a spatial reason test would be a measure of Gc, not Gf. But this would be the rare exception. In
almost all cases, spatial reasoning item content consisting of two-dimensional drawing of shapes
would constitute a measure of fluid ability.

Second, among the subtests measuring crystallized ability there are, roughly, three tiers of item
content as it relates to work content.

Tier 1: Work-Neutral Content. Item content is unrelated to work contexts and has no appearance
of being in a work context. A common example is subtests designed to measure computation skills.
Like OPM Math, DAT Numerical Ability, EAS Numerical Ability, PPM Numerical Computation, PET
Quantitative Problem Solving, and GATB Computation, these subtests present arithmetic problems in
addition, subtraction, multiplication, division usually with no other context. Even if some (many?) jobs
require computation, these context free arithmetic problems typically are not designed to sample
arithmetic problems in the workplace. Similarly, many tests based on knowledge of word meaning
such as vocabulary tests (e.g., GATB Vocabulary, PPM Verbal Comprehension) choose words based
on overall frequency of use, rather than frequency of use in work settings.

Subtests with work-neutral content are listed on the left.

Tier 2: Work-Like Context. Item content is placed in a work-like context but the work context is
irrelevant to the application or definition of the skill/ability required to answer the item correctly. An
example is the Verify Numerical Ability subtest designed to measure deductive reasoning with
quantitative information. SHL places the quantitative information in a work-like context such as an
excel file with work-related labels on rows and columns and perhaps asks the deductive reasoning
question about selected quantities in the language of a business problem. For example, “Which
newspaper was read by a higher percentage of females than males in Year 3.” In such items, the
context is work-like but the work-like context is unrelated to the skill/ability required to answer
correctly. That is, the experience of having learned to perform a job has no distinctive relationship to
the skill/ability required to answer the question.

Subtests with irrelevant work content are shown in the middle “column” under crystallized ability.

Tier 3: Relevant Work Content. Information / knowledge learned to perform work influences the
ability to answer correctly. None of the reviewed batteries includes Tier 3 work content. Only OPM
Judgment and, probably, OPM Reading appear to be influenced by learned work content. Common
examples of such tests are situational judgment tests (SJT) and, in the extreme, job knowledge tests.
Our evaluation of the OPM Judgment test is that it is an SJT designed to be relevant to a family of
related jobs. Different job families would require different SJTs.

The conspicuous disadvantage of SJT’s, like job knowledge tests, is that they require periodic
monitoring for currency. Job changes may require SJT changes. Also, SJT’s usually require expert
consensus to establish the scoring rules and become a surrogate measure of closely related job
experience.

Watson-Glaser has been positioned between Tiers 2 and 3 although not with great confidence. It is
speculation on our part that work experience in higher level roles and in roles requiring complex
decision making, diagnoses, risk evaluations and the like will enable higher levels of performance on
Watson-Glaser tests. This may also be true across all types of content ranging from business
contexts, school contexts and media contexts. In any case, an important point to make about
Watson-Glaser is that the same item content has been in use for decades (updated from time to time
for currency and relevance, but with precisely the same construct definition) and has been applied
across the full spectrum of management, executive and professional domains of work. There is no
evidence to suggest it has been more (or less) effective within the range of roles or job families in
which it has been applied.

156
Using the Figure 1 representation of construct – content characteristics of cognitive tests appropriate
for employment purposes is that the subtests in the Gc –Tier 2 category are likely to be the most
appropriate, efficient and useful prototypes for subtest development where MCS has an interest in
user acceptance of test content as reflecting work-related contexts. It should be clear that choosing
Tier 2 content over Tier 1 or, for that matter, over Tier 3 content will have little consequence for
criterion validity in job domains that are not highly dependent on specialized job knowledge that can
be learned only on the job or in training provided by the hiring organization.

Also, Tier 2 content can be developed without requiring special task-oriented or worker-oriented job
analyses as would be required to develop Tier 3 content. Tier 2 content does not require precise
descriptions of important job tasks or required worker abilities. Often available information about
major job tasks/responsibilities/duties can provide appropriate information for building Tier 2 content.
Also, the involvement of job experts can provide useful ideas about Tier 2 work information.

This evaluation that Tier 2 content may be optimal for certain subtests does not mean we recommend
that all tests be targeted for Tier 2 content. (More detail about this point will be provided in the
Recommendations section.) There would be no loss in validity for choosing Tier 1 content, especially
for subtests intended to be used across all job families, as might be the case with Gc assessments of
reasoning / problem solving.

Level of Content

The distinction among tiers of content describe above is a distinction in the manner in which work
content affects item characteristics. This section is about level of content which represents two
different aspects of item content. The first is the reading level of item content and the second is the
job level. Table 71 describes these two features of content level for the seven reviewed batteries.

Table 71 shows that four of the seven batteries specify a target reading grade level that was part of
th
the item specifications. DAT PCA and GATB were developed at a 6 grade reading level, Watson-
th
Glaser at a 9 grade level, and PET at a college graduate level. Notably, the four batteries that are
intended to be used across the full range of work are all designed with virtually no work content
embedded in the items. As Figure 1 shows, only the DAT PCA Mechanical Reasoning subtest
among all DAT PCA, PPM, EAS and GATB subtests contains Tier 2 work content. (OPM is not
included in Table 2 because we have no information about reading level.) Clearly, the test
developers avoided embedded work content that could affect the level of item content for those
batteries intended to have broad applicability across the full range of jobs.

157
Table 71. Content level information for reviewed batteries.

Battery Reading Level (US) Job Content Level

Virtually no job content but


PPM Not specified intended to be appropriate for all
job families
th
Watson-Glaser Up to 9 grade Manager / Professional
Verify Not Specified Manager / Professional
No job content but intended to
be appropriate for the full range
EAS Not specified of work from entry-level wage
work to managerial /
professional
No job content but intended to
be appropriate for the full range
th
DAT PCA 6 grade of work from entry-level wage
work to managerial /
professional
th No job content; intended to be
GATB 6 grade
appropriate for “working adults”
th
PET Up to 16 (Bachelors) Manager/Professional

An important consideration regarding choice of item level is item and test discrimination. In general,
selection tests should be designed to have maximum discrimination value near the critical level of
ability that distinguishes those selected from those not selected. In IRT terminology, this would mean
that the peak of a Test Information Function (TIF) should be in the critical theta range that where the
distinction between selected and not selected is located. In CTT terms, this means that the
conditional SEM should be lowest in that same critical range of ability. The optimal overall item and
test development strategy would be to develop items with difficulty levels in the neighborhood of this
critical range. This would mean that tests with high selection rates would be most effective if they
were relatively easy. Tests with low selection rates should be relatively difficult.

When a cut score is used and the test is the only consideration for that step of the hiring process, then
this critical narrow range is known precisely. But when the test is part of a composite and there is no
precise cut off score applied to that composite, as will be the case with the MCS process, this critical
test score range can easily become very wide. That is, the critical range of test scores above which
every applicant is selected and below which no applicant is selected can easily be more than 50% of
the range of all test scores. A practical implication is that it is very likely the TIF curves for MCS
subtests should peak near the middle of the range or be relatively flat across the range of ability.

One method for influencing the precision of test scores in the desired range is to specify desired
levels of item difficulty. Selecting the target level(s) of item content is one mechanism for
manipulating item difficulty in the applicant pool. Choice of reading level is one way of influencing
item difficulty; level of content is another way of influencing item difficulty. This is perhaps the best
argument for tailoring level of subtest content to level of job family. It’s not about criterion validity as
much as it is about selection accuracy. Considerations such as reading level and content level help
produce items measuring ability levels in the range of the selection decision.

Although SHL does not address this issue, the question can be raised whether variation in reading
level would introduce construct-irrelevant variation thus reducing the validity of the test. Overall, for
the two primary reasons below we believe this would not threaten the predictive validity of selection
tests like those reviewed in the Project and under consideration by MCS.

158
1. Once reading has been introduced to item content it influences test scores whether it is varied
a little or a lot. Once reading is required for item performance, variation in reading ability
among applicants ensures that reading will lead to variation in test scores. Increasing the
variation in reading levels is unlikely to increase the extent to which reading adds variance.
Indeed, it is conceivable that increases variation in item reading levels could reduce the role
of reading in item scores.
2. Cognitive tests designed for personnel selection use are usually not intended to measure
theoretically singular ability constructs. The recommendations below do not suggest that
MCS develop theoretically singular subtests. Rather, subtests used for selection often
capture more than one narrow cognitive construct in an effort to maximize predictive validity.
Indeed, the addition of reading to item content is likely to improve subtest validity for jobs in
which reading is a significant requirement. (Note, in the US the primary issue with reading
content in cognitive selection tests is whether the additional variance it contributes also
increases group differences such as white-black differences. In that case, researchers
examine the consequences of eliminating reading content usually by evaluating whether any
loss in predictive validity is not so consequential as to outweigh the gains resulting from
smaller group differences.)

The question of reading’s impact on dimensionality is separate from reading’s likely positive (or
neutral) impact on predictive validity, This is a complex issue, but the overall result with selection
tests like Watson-Glaser and Verify is that reading content generally does not violate the common IRT
assumption of unidimensionality enough to invalidate the useful application of IRT. Likely, this is for
two primary reasons. First, professionally-developed selection tests generally avoid introducing
reading content where it would be markedly inappropriate such as in subtests designed specifically to
assess Perceptual Speed and Accuracy subtests or Abstract Reasoning. Second, where reading
content is included, it is likely sufficiently positively correlated with the other targeted abilities that it
does not violate the unidimensionality assumption too much.

SHL takes a unique approach to this issue of content level with its Verify battery. Within each
subtest, Verify item content is tailored to the job level by selecting items from within higher or lower
ranges of theta values. That is, the level of item difficulty shifts up (down) to accommodate higher
(lower) job levels. This probably benefits criterion validity very little but is presumed to benefit
decision accuracy among “close call” applicants.

For MCS’s civil service testing system that will be used mainly by college-level applicants for service
th
sector administrative and managerial jobs, the 6 grade (US) reading levels in GATB and DAT PCA
th th
are too low. Reading levels more similar to Watson-Glaser (9 grade (US) or lower) and Verify (16
grade (US or lower) will be more appropriate and better ensure discriminating test cores near the
middle of the ability range associated with the level of ability required to perform successfully in the
target job family/role.

Approaches to Item / Test Security and Methods of Delivery


A major consideration for MCS will be its approach to minimizing and protecting against threats to
item and test security. The key decision will be whether tests shall be administered online in
unproctored environments. We are aware that MCS’s plan is to provide for the administration of the
tests in managed test centers and that the administration of the tests in those centers will be
proctored whether in online or paper-pencil mode. MCS views online, unproctored testing as a very
remote possibility and is not requesting evaluations or recommendations relating to unproctored
testing at this point.

Current Status among Publishers of Reviewed Batteries

All the publishers of the reviewed batteries provide for online delivery of these batteries. In most
cases, there are explicit strategies in place for managing unproctored administration of those online
batteries. For example, SHL and Pearson provide for randomized forms to be delivered online,
159
proctored verification testing for competitive applicants, and detection methods for comparing the
unproctored and proctored results to evaluate the likelihood of cheating. At the other end of the
spectrum, Hogrefe (PPM) provides for online administration but does not authorize unproctored
delivery. However, they are aware that unproctored delivery occasionally occurs but have no
systematic strategy for detecting possible consequences for item characteristics.

Overall, Pearson, and especially SHL have demonstrated a strong commitment to “best practices”
and adherence to professional standards (ITC, 2001) with regard to maximizing security for
unproctored testing. (We make this point while acknowledging that there is not a professional
consensus on unproctored administration of employment tests. Many professionals hold the position
that unproctored testing is not professionally ethical because it knowingly creates an incentive and
opportunity for increased cheating.) Their practices will be of interest to Sabbarah because many of
them will also be useful for managing the long-term security of MCS’s proctored online testing
process.

Methods for Managing Item / Test Security

Given that unproctored testing is a very remote possibility, the implications for best practices from our
review of these seven batteries is ambiguous because the primary focus of the publishers’ attention to
cheating is online, unproctored testing. Nevertheless, we will briefly describe the components of the
item / test security strategy SHL has actively promoted because many will be applicable and useful
even for proctored test administration.

SHL describes a security strategy that is fully in effect with Verify that has two primary components:
(1) tactics within the context of the test administration process and (2) tactics external to the
administration process. Most of these could be applied even for proctored administration, although
the risk of cheating is likely reduced by the proctoring itself.

Tactics Internal to Test Administration

1. Avoid the use of single fixed test forms. This can be accomplished in a number of ways.
a. At a minimum this may take the form of randomizing the order in which a fixed set of
items is presented. This simple step can deter collaborative approaches to cheating
in which several conspiring test takers remember items in a predetermined range, for
example, items 1-5 for conspirator 1, and so on.
b. Randomize the selection of items into tests administered to individual test takers.
This process typically, but not necessarily, relies on (a) IRT calibration of all items in a
large pool and, often, (b) content based random sampling of items from within narrow
clusters of items organized in a multilevel taxonomy of item content. This process is
most efficient when the IRT requirements of unidimensionality are satisfied at the
subtest level at which scores are used to inform selection decisions. Pearson’s
investigation of this requirement with the new, shorter Watson-Glaser forms found
that the assumption of unidimensionality was marginally satisfied in the full pool of all
Watson-Glaser items and produced more reliable forms than a model using different
IRT parameterization for different subscales. SHL has chosen not to share additional
information about their investigation of the same issue for Verify. However, the
publically available information implies that the IRT parameterization was performed
within each of the first three item banks, Verbal Ability, Numerical Ability and
Inductive Reasoning. This suggests that SHL’s randomized forms approach does
not require unidimensionality across these three subtests but only within each
subtest.
c. Adapt each test to a set of items tailored to the particular test taker through some
form of multistage or adaptive testing model. While this approach may produce more

160
reliable scores with fewer items, there is no reason to believe it produces more
secure tests than randomized forms.
d. For unproctored administration, develop a system of proctored verification tests
following positive unproctored test results and develop procedures for following up on
discrepant results that do not assume guilt in the test taker and do not inadvertently
disadvantage.

2. Communicate clearly to test takers about their responsibility to take the test honestly and
require them to sign a “contract” that they agree to do so.

3. Use technology advances to ensure the identify of test takers


a. E.g., thumb print, retinal eye pattern or keystroke pattern signature
b. Remote monitoring through CCTV/webcam capability

Tactics External to Test Administration

1. Actively manage and monitor the item bank.


a. Monitor item parameters for evidence of fraud related drift in parameter values
b. Manage item exposure by exposure rules, use of multiple banks, and/or regular
development of new items
c. Use data forensics to audit test taker response patterns that indicate unusual patterns
of responses associated with fraudulent attempts to complete the test. Data
forensics benefit from the capability to time keyboard responses to detect fast
responses and slow responses.

2. Monitor web sites with “web patrols” searching for indications of collaborative efforts to collect
and disseminate information about the test.

3. Comprehensive training, certification and monitoring of test proctors.

Even without unproctored administration, the risk of collaborative pirating of test information increases
with the frequency with which tests are administered and the number of administratively diverse
testing centers.

Item and Test Development


The review of seven batteries revealed three broad, strategic approaches to item and test
development – bank development, item accumulation and fixed form production. At the same time,
there appeared to be several characteristics in common across all three strategies.

Bank Development Approach

SHL for Verify and Pearson for UK Watson-Glaser Unproctored deliberately developed items in order
to build large banks of items with diverse items within banks, even within subtest-specific banks. This
approach was dictated by the objective of using randomized forms for each test taker to support
unproctored online administration. In this approach, there is a distinctive requirement to develop a
large number of items and pre-test those items in large samples to provide stable estimates of IRT
parameters. Also, for both Verify and Watson-Glaser there was a requirement to develop diverse
items within subtests (Verify) or subscales (Watson-Glaser). For each subtest, SHL required that a
wide range of item difficulty/complexity/level be developed so that each subtest item bank could
accommodate the SHL approach to tailoring online tests to the level of the job family for which the
subtest was being administered. (SHL creates randomized forms, in part, by shifting item difficulty up
(down) for higher (lower) level job families.) Also, SHL specifies diverse work-like item content for
Verbal Reasoning, Numerical Reasoning, and, possibly, Reading Comprehension to bear some

161
similarity to work content in the job families for which these subtests may be used. In contrast, SHL
does not require diversity of work-like content for the other subtests that either have abstract item
content or work-neutral content. Pearson also requires diversity of item content within the five
subscales (Recognize Assumptions, Evaluates Arguments, Inference, Deduction and Interpretation,
the last three of which form Draw Conclusions). But it is a different type of diversity. First, there
does not appear to be any instruction or desire to significantly vary item difficulty/complexity/level
within a subscale because Watson-Glaser does not systematically rely on item difficulty to construct
randomized forms. Pearson does not report the instructions / specifications provided to item writers
that define the target level(s) for items.

Watson-Glaser does require that for each subscale item content be developed for a variety of
common contexts including business, school, and newspaper/media. This is an important
consideration because Watson-Glaser form construction requires that a minimum percentage of items
within a randomized form be business-related. In addition, the Watson-Glaser test requires a portion
of passages-items to be controversial and a portion to be neutral. This portion is deliberately
different between subscales with Evaluates Arguments being based on the largest portion of
controversial passages – items. Unfortunately, Pearson does not report any item development
methodology to ensure that the controversial / neutral quality of draft items is actually perceived by
test takers.

Both SHL and Pearson rely on empirical pre-testing new items and making early decisions about
retaining new items based on classical test theory psychometrics relating to item discrimination
indices, item difficulty, distractor properties, among other considerations. Pearson relies heavily on
experienced expert reviews of draft items to better ensure they satisfy the specific critical thinking
facets intended for each subscale.

After review and psychometric pre-testing, SHL and Pearson estimate IRT parameters in large
samples of test takers. A key consideration is whether the items within a bank satisfy the IRT
requirement of unidimensionality. Pearson’s analysis showed that the unidimensionality assumption
was satisfied sufficiently within the entire bank of items assessing all subscales. SHL, however, has
not provided reports of this level of psychometric detail. Presumably, the unidimensionality
requirement is applied only within subtest banks, not across subtest banks.

For both Pearson and SHL, the initial bank development for their unproctored strategy was seeded
with existing items that had been used in other batteries containing similar subtests. The pace at
which these banks become unique sets of items used only for unproctored administration is not clear.

Item Accumulation Approach

The interview with PSI’s psychologist, Dr. Brad Schneider, disclosed an item development strategy,
especially for EAS if not PET, that was not clear from technical manuals or other documents. We call
this the item accumulation approach. The key feature of the item accumulation approach is that as
PSI develops and tests new items, they are stored in accumulating banks of items, many with long
psychometric histories. These banks of items are then used periodically to replace current fixed
forms of subtests with new fixed forms of subtests without the process of identifying and introducing
each new form. This approach to change in fixed forms is transparent to user/reviewers. It appears
that this type of transparent form change occurs more frequently with PSI cognitive tests including
EAS and PET than with other reviewed batteries such as GATB and PPM.

In spite of the large banks of accumulated items, many or all of which have estimated IRT parameters
used to construct new fixed forms, PSI has not implemented any version of randomized forms or
computer adaptive testing in support of their online administration platform. Rather, at any point in
time EAS is administered with three fixed forms. One is reserved for proctored administration,
another for unproctored online administration in the US and a third for unproctored administration

162
outside the US, whether online or paper-pencil. Clients are made aware if they choose unproctored
administration that the form they will use is not a secure form and there is some amount of risk that
scores may be distorted.

This incremental approach to item development and bank accumulation allows PSI to capitalize on an
item writing strategy in which new items are written to be parallel to specific existing items. The
production of item clones enhances the confidence in the parallelism of the resulting new forms. (It is
not known whether PSI uses currently available software to automatically generate new items using
highly specifiable item construction rules. SHL currently uses this approach at least for the items
types that are highly specifiable such as arithmetic computation items.) Also, it appears that even
though item banks accumulate over time, PSI follows an episodic item development approach by
which new items are only developed when the is a recognized need for new forms and the existing
bank may not be sufficient to provide all the needed items.

Fixed Form Production Approach

Publishers for the remaining batteries, PPM, GATB, proctored Watson-Glaser, and, possibly, DAT
PCA, rely on a traditional fixed form approach to item and test development. The histories of PPM,
GATB and unproctored Watson-Glaser are similar in that one or two forms have remained in place for
10-20 years, or more, before they are replaced with one or two new forms. DAT PCA clearly has
been revised more frequently, especially more recently, and may be managed currently in a way more
similar to the Item Accumulation Approach. It is unclear.

With the possible exception of DAT PCA, this approach has led to infrequent episodes of item writing.
The extreme case is PPM where, it appears, no new form has been developed since its introduction in
the mid-late 1980s. GATB and unproctored Watson-Glaser both date back to the 1940’s for their
original two forms and both have had 2 or 3 pairs of new forms developed since then. Occasionally,
the introduction of the infrequent new form has been the opportunity for improvement or change.
For Watson-Glaser, each new pairs of forms was used to shorten the tests and address psychometric
limitations of earlier forms. For GATB, the development of Forms E and F was the occasion to
eliminate Form Matching, reduce length and eliminate sources of race/ethnic and cultural bias..

Episodic, fixed form production invites an item development strategy that rests heavily on the
development of new items developed to be parallel to specific existing items. The development of
items for GATB Forms E and F relied heavily on this approach but not the development of new forms
for Watson-Glaser. Indeed, a primary emphasis in the development of Forms D and E was to update
the content of the passages and items to be relevant to current populations globally. However, these
targeted improvements did not lead to any change in the target constructs or definition of critical
thinking. Precisely the same definition of critical thinking was specified.

Common Item Development Characteristics

While the contexts and objectives for item development have varied considerably among the seven
reviewed batteries, certain characteristics of the item development were common across most
batteries.

Test Construct Selection

With the notable exception of Watson-Glaser, the publishers made remarkably similar choices about
the target test constructs. As shown in Table 70 above, excluding Watson-Glaser, either 5 or 6 of the
remaining six batteries developed at least one subtest in the following construct categories (No
category represents a theoretically single or narrow cognitive ability, as if it were sampled from the
CHC framework.. Rather, each represents a class of related cognitive abilities with certain features in
common that have an empirical history of predicting work performance.) (Note, we acknowledge that

163
this commonality may, in part be due to the criteria by which these batteries were selected. They
may not be representative of the cognitive tests in common use for employment purposes. However,
they were not explicitly selected to be similar.)

 Verbal Ability. All batteries explicitly developed one or more subtests measuring some verbal
ability. Two primary types of verbal ability were selected, (a) verbal knowledge such as
vocabulary, spelling, comprehension, and reading, and (b) reasoning about verbal
information, which requires some form of logical reasoning applied to information presented in
moderately complex verbal content.

 Quantitative Ability. All batteries explicitly developed one or more subtest measuring some
quantitative ability. As with verbal ability, two primary types of quantitative ability constructs
were selected, (a) knowledge of some facet of arithmetic, and (b) reasoning about
quantitative information.

 Fluid Reasoning Ability. Other than GATB, all batteries included at least one subtest
measuring fluid reasoning ability. In most cases this was some form of reasoning about
relationships among abstract symbols/figures.

 Spatial / Mechanical Ability. Other than PET, all batteries include a subtest measuring some
form of spatial / mechanical ability. None of the reviewed subtests in this category (see
Figure 1) were developed to be job knowledge tests but some will be more influenced than
others by mechanical knowledge. We acknowledge that there is conceptual overlap between
this category and Fluid Reasoning Ability. We distinguish them for purposes of Study 2
because the common rationale for using Spatial / Mechanical Ability subtests is quite different
than the rationale for using subtests measuring Fluid Reasoning Ability. The distinction lies
in job characteristics. Spatial / Mechanical subtests are ordinarily used only for job families
involving the use of equipment, tools, mechanical devices where trade skills may be required.
In contrast, subtests of Fluid Reasoning Ability are most often used with higher level jobs that
require decision making with complex information.

 Processing Speed / Accuracy. We include in this category perceptual speed and


psychomotor speed / dexterity. Like Spatial / Mechanical Ability subtests, these are
commonly used only for specific job families including, mainly, clerical/administrative jobs and
manual tool use jobs.

The point of this description is to demonstrate a framework for organizing types of cognitive abilities
that is commonly used with cognitive ability employment tests.

Item Content

For job-oriented employment tests an important decision about items is the extent to which and
manner in which work-like content will be introduced into the items. Figure 1, above, and the
accompanying discussion describe the large differences among cognitive employment tests with
regard to item content. Even though cognitive ability employment tests typically focus on a small set
of common constructs, their item content can be quite different. Certainly commercial interests play
an important role. Subtests with work-neutral content may be more acceptably applied across a wide
range of job families. Subtests with job-specific content may be more appealing for specific job
families. Until recently when Postlethwaite (2012) demonstrated that crystallized ability tests are
more predictive of job performance than fluid ability tests, there has not be a clearly understood
empirical rationale for preferring more or less job-like content in cognitive ability tests. As it has
happened, it appears that the majority of subtests contains learned content, whether work-neutral or
work-relevant and presumably benefits from the advantage of crystalized ability. In spite of the

164
recent criterion validity results relating to crystalized ability, we anticipate that commercial interests will
continue to play the most influential role in the decision about the work relevance of item content.

Use of CCT Psychometrics for Empirical Item Screening

Although some batteries do not provide this information, both the Verify and Watson-Glaser batteries
that rely heavily on IRT psychometrics to manage banks and construct randomized forms, report
relying on CCT psychometrics for the initial screening of pre-tested items. This is likely to continue to
be the common practice for commercial tests because CCT estimates do not require as large samples
as IRT estimates.

Culture / Sensitivity Reviews

A common practice continues to be the use of culture experts to review draft items for
culturally/racially/socially sensitive content that could introduce invalid, group related variance into
items responses.

Measurement Expert Reviews

Some publishers use an additional round of item review by item development experts to better ensure
item quality prior to empirical pre-testing. This approach appears to be most important for novel,
complex items such as Watson-Glaser that relies heavily of expert reviews.

Specification of Item Level/Difficulty/Complexity

No publisher provided information about the manner in which item writers were instructed re the
targets for item level/difficulty/complexity. Nevertheless, this will always be an important process
requirement for the development of items and subtests appropriate to the target job families and role
levels. Specified reading level, which some batteries used, may be a part of this consideration. But
other considerations such as level of important job tasks and applicant pool characteristics such as
education level should also inform the decisions about target levels/difficulty/complexity of item
content.

IRT-Based Test Construction

Although few technical manuals provided information about fixed forms test construction, the
interviews with publishers’ industrial psychologists revealed that most of the publishers rely on IRT
psychometrics to construct and describe whole subtest characteristics. This is certainly true for the
Verify subtests and the unproctored Watson-Glaser in the UK, but it this approach is also used by PSI
for EAS and PET subtests. It is also likely that this approach is used for DAT PCA, but we have not
been able to confirm that. This frequent practice indicates that IRT methods are used in test
construction even where tests/subtests use number-correct scoring or other manual scoring
procedures. This approach has a number of advantages ranging from providing a test information
function for any combination of items, which allows poor performing or corrupted items to be easily
replaced with other items that maintain the same TIF characteristics, shifting test characteristics by
replacing items that improve precision in the range of the ability distribution near the critical score
range that differentiates hired from not hired, and enabling the rapid production of equivalent new
fixed forms as the item bank accumulates more items.

Further Considerations Regarding IRT-Based Randomized Forms

Both SHL with Verify and Pearson with Watson-Glaser use IRT-based methods for generating a
“randomized form” for each candidate. While it is possible that two candidates could receive the

165
same randomized form, it is extremely unlikely. The primary purpose of randomized forms is to
protect item and test security in an unproctored, online test environment. Although the details about
their IRT methodology for generating randomized forms were not available to this Project, we can
safely speculate about certain characteristics of the construction process.

1. The item banking process stores several characteristics of each item including keyed
response, estimated IRT parameters, some manner of designating the target ability, other
information about more specific abilities or narrow clusters of highly specific content
groupings(for example, Pearson stores information about the specific subscales each item is
linked to as well as the neutral-controversial characteristic) and, possibly, administrative
information such as time since last use, frequency of use, and pairing with other items that
should be avoided.
2. The process of selecting items into randomized forms is based on complex algorithms to
control for a variety of test characteristics. The Pearson (Watson-Glaser) psychologist
explained that there were approximately a dozen specific rules that each item selection must
satisfy, although he would not share what those rules were.
3. Nothing about randomized forms would prevent the administration the items in any particular
order or grouping. Certainly, randomized forms could easily be administered by ordering the
items by difficulty or grouping the items by content similarity or balancing content
considerations across different subsets of content. Indeed, the last two of these
characteristics of item order and grouping are required by the Watson-Glaser assessment..
4. It is possible that at some late stage of item selection, items are randomly selected from
relatively narrow and small clusters or grouping of highly parallel items in the bank taxonomy.
This could be the case, for example if item difficulty and discrimination were each polytomized
into several categories and were then used as dimensions of the taxonomy. This would
enable the taxonomy to store information necessary to identify groupings of parallel items
highly similar with respect to narrowly defined content distinctions, discrimination, difficulty
and possibly other characteristics. It is conceivable that an efficient construction strategy
would be to randomly select an item from within such narrow clusters where the construction
algorithm is indifferent to the differences between items in such narrow groupings.

Neither SHL nor Pearson provided literature resources describing their IRT-based algorithms for
generating randomized forms. As a result, it was not possible to carefully evaluate either algorithm.
However, the equivalency evidence presented by Pearson in the Watson-Glaser (UK) manual
provides persuasive evidence that randomized forms are correlated enough to treat them as
equivalent.

Strategies for Item and Bank Management


Approaches to item and bank management appear to differ considerably across the seven reviewed
batteries. These differences can be captured using the three categories identified above for item
development.

Fixed Form Production Approach

At one end, represented by PPM, GATB, unproctored Watson-Glaser and, possibly DAT PCA, there
is no bank management because the only “active” items are those in the fixed forms. In this case,
items are also regarded as fixed and are not routinely replaced. If evidence accumulates about
problematic items or compromised test forms, new forms are developed and replace existing forms.
Historically, such replacement is infrequent with forms often in use for 10-20 years or longer. For
these publishers, we have no information indicating that they routinely update items’ psychometric
characteristics based on accumulating data.

166
Item Accumulation Approach

EAS and PET items appear to be managed somewhat more actively with accumulating banks of
items. Some items in these banks are active in the sense that they are in fixed forms that are
currently in use. But other items are inactive and are not in forms that are in use. PSI describes a
process of routine monitoring of items within active forms to detect indicators of over-exposure, or
significant fraudulent responding. Indeed, PSI distinguishes between forms available only for
proctored administration, unproctored US administration and unproctored administration outside the
US. This suggests that items are tagged with their current form “status” and that different standards
for retention are applied to items depending on the use of their current form. However, even PSI does
not replace single items. Instead, they will refresh a whole form by replacing all/most items and
regarding the new set of items as a new form, even if this is transparent to the user. Using this
approach, PSI relies on IRT parameter estimates for all accumulated items so that new, transparent
forms may be introduced and equated to previous forms using IRT psychometrics. This approach
also suggests that PSI attends more regularly than publishers of fixed forms with no banks to the
accumulating or recent psychometric properties of active items. PSI provided no detail about the
manner in which they might be doing this.

Bank Development Approach

SHL for Verify and Pearson for unproctored Watson-Glaser in the UK describe a process of active
bank management to support the production of randomized forms to each individual unproctored test
taker. Between them, SHL appears to take a more active approach specifically to monitoring current
item characteristics using data forensics as provided by their partner Caveon, which specializes in
forensic analysis of item responses to detect aberrant patterns. Also, SHL routinely refreshes items
in their banks, based in part on indications of patterns of cheating or item piracy that may have
compromised certain items. Pearson’ description of their approach to bank management does not
indicate the use of data forensics or routine refreshing of items in the bank.

In general, active bank maintenance both for security protection and for forms production requires that
each item be tagged with several characteristics. Pearson describes that each Watson-Glaser
passage-item combination is tagged with 10-15 characteristics. Although Pearson did not identify all
these characteristics, these include IRT parameter estimates (Pearson uses a 3-parameter model
with a fixed guessing parameter), item type (business, school, news media, etc.), subscale
(Recognize, Evaluates, Inference, Deduction, Interpretation), and other item characteristics such as
exposure indices, date of most recent psychometric update, keyed response of correct alternative,
and linkage to other items via passages. All of these 10-15 characteristics are associated with
specific rules for the production of randomized forms. SHL presumably uses a conceptually similar
approach of tagging Verify items with characteristics that are used to create equivalent forms. For
example, SHL’s approach with Verify of linking subtests to job families likely requires that each item
have information tagged to it regarding the job families or universal Competencies to which it is linked.

Overall, the Bank Development Approach requires routine, ongoing monitoring of item characteristics
and, at least in SHL’s support of Verify, routine development of new items to continually refresh the
bank.

Potential Item / Bank Management Considerations

All of the three approaches described above will be influenced by several policies about item and
bank characteristics. While none of the publishers provided information about these policies, it may
be helpful to identify certain key ones here.

Data Aggregation. Where the psychometric characteristics of items are periodically updated, an
important consideration is the manner in which previous item data is aggregated over time. With

167
small volume batteries, it may be advantageous to aggregate item data as far back in time as the item
has remained unchanged. The advantage is in the increased stability of the estimates. However,
with high volume batteries two other interests may outweigh modest gains in stability from exhaustive
aggregation. First, publishers and users may have more interest in the current item characteristics
rather than historical characteristics. That interest would favor updating item characteristics but only
based on the most recent data, where recent data replaces dated data. Second, forensic analysis or
larger scale national/social analyses may have an interest in the pattern of change over time. While
this interest is likely to be focused on changes in applicant populations rather than changes in item
properties, nevertheless it implies that longitudinal data be segmented in some fashion with item and
applicant characteristics are evaluated within each segment.

Exposure Rules. An objective in any approach to the production of a very large number of
equivalent forms from a finite bank is that over-sampling of the most useful items should be avoided.
With highly specified algorithms, over-sampling may be managed by specifying item exposure limits.
With proctored fixed forms, publishers generally act as if no amount of exposure is too risky.
However, in a highly visible and “closed” environment like MCS’s civil service testing system may be,
exposure may be a significant concern even with a modest number of fixed forms.

Item Retention Thresholds. Publishers/developers may set different retention thresholds for initial
item development processes than for longer term bank maturation objectives. For example, during
the initial pre-testing phase of item development lower item discrimination values, e.g., r item-total, may
be acceptable in the interest of building a large enough bank and avoiding the deletion of a potentially
valuable item once a large sample is available. During the ongoing process of bank maturation,
higher retention thresholds may be established in the interest of migrating to a bank of more uniformly
high psychometric quality. Of course, this example of lower thresholds in early item development
decisions than in later bank maturation decisions assumes that it is reasonable to expect sufficient
bank stability that gradual maturation should take place over time. But that may not be the case for
all types of item content or all types of administration contexts. For example, highly technical job
knowledge content might be expected to have a relatively short life span and require frequent
updating with new items. Unproctored administration may require a bank management strategy of
continual item replacement such that no bank “matures” over time. (SHL appears to have adopted
this view to some extent even though they also emphasize practices intended to minimize the
likelihood and incentives for cheating.)

Use of Data Forensics. SHL, with its partner Caveon, has promoted the strategy with unproctored
administration of routine data forensic investigations of test takers responses and patterns. Yet,
there are strong arguments that proctored modes of delivery are subject to corruption due to cheating
and piracy, especially with collaborators (Drasgow, et al, 2009). Publishers should not presume that
data forensics have value only when tests are administered in an unproctored setting.

Bank Health Metrics. Closely related to the consideration of item retention thresholds is the
consideration of bank health metrics. (As a practical matter, this consideration is relevant only for the
Bank Development Approach to item and bank management.) Such metrics are intended to be
bank-oriented metrics whether they are simply the aggregate of item-level or forms-level
characteristics or some other bank level indicators such as coverage of all required content domains.
In effect, such metrics would constitute the “dashboard” of bank characteristics publishers would
routinely attend to. Possible examples include (a) the average reliability of whole forms of subtests
produced by a bank following all construction rules; (b) the distribution of item use frequencies, and
(c) average reliabilities of subsets of items associated with particular subcategories of items such as
content categories or job categories.

Bank Size. Many factors influence the optimal size of any bank. These include the number of item
characteristics that must be represented on forms, the number of administrations, item exposure
considerations, subtest lengths, and item loss and replacement rates. The task of estimating the
168
optimal bank size can be very complex and can depend on uncertain assumptions. Nevertheless,
unless a publisher/developer has an approximate estimate of desired bank size, it will be difficult to
budget and staff for the work of item development and bank maintenance.

Use of Test Results To Make Hiring Decisions

This section, while important for Sabbarah / MCS’s future consideration, depends only in a minor way
on the Study 2 review of the seven batteries. The reason is that publishers provide little information
about the manner in which their test results might be used by employers. At most, some publishers
identify ranges of test scores that are usually norm-based for employers to use as standards for
interpreting score levels. But even in these cases, the suggested ranges and interpretations cannot
reflect an individual employer’s particular set of considerations. Similarly, publishers almost always
recommend relying on additional selection considerations beyond tests scores but rarely describe
methods for doing so. Generally, the publishers of the reviewed tests understand and accept that
each employer’s particular method for using their test scores should be tailored to that employer’s
particular needs and interests.

For this reason, this section is not based on publishers’ recommendations or practices. Rather, it is
based on the Study 2 authors’ understanding of Sabbarah and MCS’s objectives and perspectives
about the new civil service testing system and our own experience with this set of considerations with
other employers. (Where an approach is related to characteristics of the reviewed tests, this will be
noted.)

The starting point for this section is that MCS has already made the following important decisions
about the manner in which tests results will be used in the new civil service testing system.

 Cut scores will not be used to “qualify” applicants on cognitive test results alone
 Cognitive test results will be combined with other selection factors in a weighed composite,
which will provide the basis for hiring decisions.
 MCS will not design the testing system to be used for other purposes than hiring decisions.
(That may not prevent users from adopting other purposes for tests results but MCS is not
designing the testing system to support other purposes.)

These decisions that have already been made limit the scope of this section to a discussion of
methods by which subtest results may be tailored to job families. (Note, related topics having to do
with the management of test results will be addressed in the Recommendations section. These
topics will include, for example, policies about the use of tests results such as the manner in which
test results obtained while applying for one job family may be applied to other job families, and
policies about retaking tests.)

That said, we make one suggestion here about MCS policy that will govern the manner in which the
composite score is constructed and used. We suggest that MCS combine test results only with other
factors expected to be predictive of future job performance such as past experience/accomplishments
records, academic degrees, interview evaluations and so on. A primary reason for this suggestion is
so that the task of weighting each component would be based on the expected job relevance of the
individual components. In this case, the task of choosing the weights will be guided entirely by
considerations of future job performance. This composite can be meaningfully understood by all
others including employers and applicants. Also, any future interest in changing the composite can
be evaluated in a straightforward manner.

We further suggest that factors unrelated to future job performance such as length of time as an
applicant, age, previous civil service, financial needs, etc. can be evaluated separately based on a
different set of considerations and values. One major benefit of not bundling these two sets of
considerations into a single quantitative composite is that the impact of the non-job related factors can
169
be more easily recognized. It will be clearer to MCS and employers what the real “cost” is of the non-
job related factors.

Methods for Tailoring the Use of Subtests to Job Families

A pervasive theme in all publishers’ materials is that it is meaningful to tailor the use of cognitive
subtests to different job families. Indeed some publishers, PSI with PET, Pearson with Watson-
Glaser, and SHL with Verify, have already taken the step of building job specific content into the
battery. PET and Watson-Glaser were both developed specifically to be appropriate for Managerial /
Professional work and not for other less complex work. Using its IRT foundation and large item
banks, SHL modifies the difficulty of Verify test content to accommodate job level differences. We
provide a framework in Table 72 for evaluating possible ways of tailoring the use of subtests to job
families.

Option 1, “No Tailoring”, refers to the practice of administering the full battery of all subtests to all
applicants, regardless of sought job. Option 2, “Tailored Weights”, as described here, assumes the
same administrative process of administering the whole battery to all applicants. The distinction
between the two is that the “Tailored Weights” produces a total score that is a weighted combination
of subtest scores where the weighting, presumably, reflects the expected job-specific criterion
validities for all subtests. An implication is that a single applicant could receive multiple job-specific
score results, whether needed or not. The potential advantage of maximized validity is likely to be a
small gain but the incremental cost of estimating job-specific subtest weights may be a small cost if
the empirical evidence necessary is already in the research literature. The applicant experience in
these two options would be identical in the testing process but would differ in the presumed feedback
process. All reviewed batteries and OPM could be administered in a whole battery form although the
number and diversity of subtests especially for GATB and PPM could be a noticeable inconvenience
to the test taker because some subtests may appear to have little relevance to the sought job. Only
Watson-Glaser would not be meaningful in the Option 2 mode. Pearson and the original developers
provided no rationale or argument for distinguishing the measurement of critical thinking between job
families.

Option 3 is an operationally significant method of tailoring because it reduces the number of subtests
administered to any applicant to be just those judged to be most relevant to the sought job. Of
course for short batteries like PET and for homogeneous batteries like Watson-Glaser, Option 3 would
not be a meaningful alternative. Neither type of battery was designed to be subdivided in this
manner. On the other hand, batteries with many and diverse subtests would realize significant user
benefits resulting from taking many fewer subtests and avoiding those subtests that would be
conspicuously unlike the job they are seeking. If test centers operate only with paper-pencil fixed
forms, Option 3 can add significant operational costs and administrative time because there are now a
larger number of discreet testing events that can be scheduled. Online administration largely
eliminates this disadvantage, relative to paper-pencil administration. This model of tailoring is
fundamental to the Verify strategy of routinely administering only the job-specific subtests.

170
Table 72. Methods of tailoring the use of subtests to job families
Test Form Construction Benefits / Costs

Methods of Tailoring Subtest Individualized


Results to Job Families Forms
Fixed Forms Benefits Costs
(I.e., MST /
Randomized)
1. No Tailoring  Lower face
Watson-Glaser
 One complete battery, validity
PPM  Easy to use
 All subtests administered Watson-Glaser,  Longer test
GATB  Minimize cost
 All contribute to total score Unproctored times
PET to develop
 Same total score used for all jobs Verify  Higher cost to
DAT PCA
OPM  Fewer testing
deliver
EAS events
 Unneeded
results
2. Tailored Weights  Lower face
 Easy to use
 One complete battery validity
PPM  Minimize cost
 All subtests administered  Longer test
EAS to develop
 Weighed combinations tailored to OPM times
DAT  Fewer testing
jobs; Verify  Higher cost to
GATB events
 Job-specific composite scores used deliver
PET  Maximize
per job  Unneeded
validity
results
3. Tailored Combinations  Easy to use
 Multiple groups of 3-4 subtests  Shorter testing
 Tailor grouping of subtests to jobs PPM sessions
 Administer only the group of subtests EAS  Minimize cost  More testing
Verify
for sought job DAT to develop events
 Use total score of administered GATB,  Improved test
subtests taker
experience
4. Tailored Content
 Tailor item/test content to jobs
o Sample items from job-
specific banks or by job-
specific rules
OPM(?)
 Content level /  Increased user  Higher cost to
PET Verify
difficulty acceptance develop
 Content type
 Administer tailored subtests + other
generic subtests per job
 Use total score of administered
subtests

There is a nuanced but possibly significant distinction between Option 2 and 3. Both produce job-
specific score results. But Option 2 (and Option 1) can produce all job-specific score results from the
same testing session whereas Option 3 only produces a single job-specific score result from a single
testing session. The issue that can be raised is whether Option 2 score results should be considered
“official” results for all jobs regardless of the applicant’s intent to apply for only one, or more, jobs.
The problem that can arise occurs when the employer (or MCS) has a retest policy that requires
applicants to wait some period of time, say 3 months, before being allowed to retake an earlier test in
an attempt to achieve a better result. In that case, an applicant may be applying only for one job and
taking the test only to compete for that one job but would be required to wait the retest period before
being allowed to take the same test to apply for a different job. In that scenario, the employer’s retest
policy could inadvertently harm the applicant’s ability to apply for a different job that would require the
same test event.

Option 4 is qualitatively different from the other options. The form of job tailoring is to modify the
subset content in some fashion. Verify uses this approach by modifying the range of item difficulty in
a randomized form to correspond to the level of the sought job. In the Verify system this is done
primarily for the psychometric benefits of item difficulty levels being more closely aligned with the level
of ability of the test takers for higher (lower) level jobs. Verify does not tailor item difficulty to achieve
greater test taker acceptance. In all likelihood, modest changes in level of item difficulty may not be
noticed by test takers.

171
Content can be tailored in other ways than difficulty whether it is Tier 2 content that does not changes
the construct being measured or Tier 3 content which tailors the construct to assess more job-specific
knowledge/ability. In a way, both PET and Watson-Glaser chose their test content to be tailored to a
specific set of job families in the management / professional domains. While neither is designed to
adapt its content to different job families, both include content designed to be appropriate for a broad
range of higher level jobs. While the primary benefit of Option 4 is increased user acceptance, it is
costly because Option 4 requires a (much?) larger item bank to enable the adaptation of item content
to differences between job families. Also, if these forms or content tailoring occur only with Tier 2
changes in content, then there will be no benefit in increased criterion validity. If the content changes
are Tier 3 changes, which change what the test is measuring, then improved criterion validity is
feasible. Neither of the batteries we reviewed that we know require large item banks – Watson-
Glaser and Verify – are designed to tailor Tier 3 content. It is possible OPM banks are designed in
that way.

Overall, these alternative approaches to tailoring the use of subtests in the context of the types of
cognitive abilities we reviewed are unlikely to benefit criterion validity in any significant way. Only if
the job content in items were of the Tier 3 type that measures job-specific abilities/knowledge, would
Option 4 tailoring possibly have a significant effect on criterion validity. None of the reviewed
batteries has been designed to depend on Tier 3 job content. Nevertheless, we have located an
apparent example of Tier 3 content in OPM’s USA HIRE Judgment and Reading subtests.

Considerations of Popularity, Professional Standards and


Relevance to MCS

The Study 2 specifications require information about the popularity, adherence to professional
standards and the relevance to MCS interests of each of the reviewed batteries. We attempt to
succinctly address each of these in a single table of information, Table 73, that allows the reader to
make an overall evaluation of the usefulness of the information available for each battery. Summary
comments are offered here for each consideration.

Popularity

Publishers were unable/unwilling to provide information about volume of usage for any battery. As a
result, we are only able to estimate the popularity for each battery from other available information.

Given the imprecision of our assessment or popularity, we only use categories of popularity to rank
order them. This information should not be taken to reflect actual usage volume.

Overall, there is strong indication that Verify and Watson-Glaser are relatively high volume batteries.
Watson-Glaser is described by Pearson as its most popular selection test. But the likelihood is that it
is more heavily used in educational contexts than selection contexts. At the same time the DAT PCA
is listed as their ninth most popular selection test. The Pearson psychologist described DAT PCA as
in the top five of their cognitive ability selection tests. Although no volume data is available for Verify,
the development information shows very large samples (e.g., 9,000, 16,000) of SHL test takers who
completed trial forms of Verify. The unproctored, online feature coupled with SHL’s aggressive
international marketing approach is virtually certain to have resulted in high volume usage of Verify.
At the other end of the popularity spectrum, the Hogrefe psychologist described PPM as a “low
volume” product used only for certain applications in the UK. Nelson distributes the GATB, but only
in Canada.

Popularity of the remaining three batteries, EAS, PET and DAT PCA, is more ambiguous. Our
professional experience is that EAS is a well-established and commonly used battery and is used as

172
a “marker” battery for construct validity studies for other tests, for example PPM. PET, also
supported by PSI, serves a narrower market with smaller volume associated with it than the broad-
based EAS. Also, the PSI psychologist indicated that PET has fewer forms than EAS in current use.
The implication is that PET has lower usage than EAS. DAT PCA has lower usage than Watson-
Glaser and subtests were recently removed from DAT PCA due to low usage. (Clients may purchase
DAT PCA by the subtest.)

Adherence to Standards

Our evaluation of adherence to professional standards is based on all available information including
interviews with publishers’ psychologists. Overall, our evaluation is that publishers range from
moderate adherence to low adherence, especially with NCME, AERA, APA Standards. Generally,
publishers deserve high marks for administrative/instructional information (e.g., Standards 3.19, 3.20,
3.21 and 3.22) but low marks for technical detail about item and test development processes such as
item/test specifications (see e.g., Standards 3.3 and 3.7). Similarly, most publishers have described
the results of construct and criterion validity studies but have provided little information about the
technical details for those studies (see Standard 1.13 for example).

Adherence is more difficult to judge with respect to SIOP’s Principles partly because they are
described in less prescriptive language and many of the principles address details relating to specific
organizational decisions/procedures such as cut scores and job analyses, which publishers may not
be in a position to address. Overall, however, publishers deserve low marks for their efforts to
protect the security of the tests, with the exception, of course, of Verify and Watson-Glaser
unproctored in the UK. We have observed that publishers over estimate the protection provided by
proctoring.

We sold note that non-compliance with professional standards does not necessarily imply that the
battery has no information value for MCS. For example, we judge PPM to be in poor compliance with
professional standards. Nevertheless, PPM offers a useful example of a full set of cognitive ability
subtests constructed with work-neutral item content. In many instances, the value of a battery for
MCS does not depend on the publisher’s adherence to professional standards, especially those
standard relating to documentation.

Relevance to MCS Interests

All batteries were recommended and selected because, for each one, certain features were initially
seen as relevant to MCS’s interests. So, all reviewed batteries were expected to have some
information value for MCS. As information was gathered, we recognized that Verify became a more
valuable source of information while PPM and Watson-Glaser became somewhat less valuable, in our
judgment. Also, the contributions of EAS, DAT PCA, GATB and PPM were somewhat more
interchangeable than were the somewhat distinctive contributions of Verify, PET and Watson-Glaser.
The relevance of each battery’s information to MCS will be more evident in the Section 7
recommendations.

173
Table 73. Information about batteries’ popularity, adherence to standards, and relevance to MCS.
Judged
Battery Adherence to Professional Standards Relevance to MCS
Popularity
Moderate High
 Technical test production detail missing  Best example of security protection for
 Item development detail missing unproctored administration
 Validation results provided  Best example of bank-oriented
 Validation procedures missing construction of multiple forms
Verify High  Security protection detail provided  Some of the best examples of work-
 Rationale for randomized forms provided like content
 Test specifications not well described  All constructs are applicable
 Reliability detail provided  Good example of strategy for
 Psychometric model described controlling subtest level

Moderate Moderate
 Technical test development detail missing  Good example of bank-oriented
 Item development detail provided construction of multiple forms with
 Construct detail lacking testlets
Watson- 
High  Reliability detailed provided Content unlikely to be directly
Glaser
 Validity results provided applicable
 Validation procedures missing  Construct, as is, may not be applicable
 Administration/scoring instructions provided  Strong publishers support of users
 Time limit information is ambiguous
Moderate Moderate - High
 Extensive validation results presented  Good example of diverse batteries
 Validation procedures missing  Good examples of work-neutral
Moderate -  Inadequate security protection in unproctored content
EAS
High setting  Most constructs are applicable
 Reliability detail presented  Extensive validation
 Item development detail missing
 Adequate administrative support
Moderate High
 Validation results presented  Very good example of higher level
 Validation procedures missing work-like content
 Inadequate security protection in unproctored  All constructs are applicable
PET Moderate
settings
 Reliability detail provided
 Modest Item development detail reported
 Admin support yes
Moderate Moderate - High
 Most item development detail missing  Good example of diverse batteries
 Most test construct missing  Good examples of work-neutral
DAT  Reliability yes content
Moderate
PCA  Validity yes  All constructs are applicable
 Validity procedures missing  Strong publisher support or users
 Marginal security protection (1 fixed form)
 Provide admin/scoring info
Moderate - High Moderate - High
 Extensive item/test development detail  Good examples of work-neutral
provided content
Moderate -
GATB  Extensive reliability and validity detail  Most constructs are applicable
Low
provided  Extensive validation
 Extensive item review processes
 Marginal security protection (fixed forms)
Low Moderate
 Criterion validity evidence nonexistent  Good example of work-neutral content
 Item development detail missing  Most constructs are applicable
 Reliability detail presented  Poor example of validation practice
PPM Low
 Construct validity detail presented  Poor example of publisher support
 Administrative information provided
 Inadequate security protection provided for
aging forms

174
SECTION 7: RECOMMENDATIONS AND SUGGESTIONS

A requirement of Study 2 is that recommendations be made with respect to (a) MCS’s GCAT plan and
specifications, and (b) important organizational and operational considerations relating to
management, staffing, security, item banking, technical support, administration and the design of test
guides, manuals and reports. This section provides these suggestions and recommendations
organized in the following manner.

1. GCAT Plan and Specifications


A. Choosing Test Constructs
B. Specifying Item Content
C. Methods of Item and Test Development
D. Validation Strategies

2. Organizational and Operational Considerations


A. Security Strategy
B. Item Banking
C. Operational Staff Requirements
D. Design of User Materials
E. Strategies and Policies

These recommendations are based on multiple sources including the reviews of the seven batteries,
our own professional experience with personnel selection testing programs and the professional
literature relating to the design and management of personnel selection systems. We do not limit
these recommendations to just the methods or approaches used by one or more of the reviewed
batteries,

GCAT PLAN AND SPECIFICATIONS

Choosing Test Constructs

Background

Perhaps no decision is more important than the choice of abilities the tests should measure. The
recommendations described here about what abilities should be selected are based on many
considerations. But there are four assumptions about MCS’s civil service testing strategy that are
important to these recommendations.

A. MCS desires an efficient testing system that uses no more tests than are necessary to
provide maximum predictive validity for all civil service jobs.
B. MCS does not intend to use these same tests for career guidance purposes.
C. MCS does not require that that the tests measure job knowledges or skills specific to
particular jobs. (We acknowledge that MCS may have an interest in a Phase 2 addition
to the recommended tests that includes job-specific tests.)
D. MCS’s civil service jobs are predominantly in the service sector of work ranging in level
from clerical / administrative jobs to management / professional jobs. The applicant
pools for these civil service jobs will have some level of college education. Some will be
new graduates, others will be experienced workers.

In addition to these assumptions, the recommendations are based on certain perspectives about tests
used for personnel selection purposes. The first perspective is that the selection purpose or use of

175
the tests requires that test scores predict performance in the target jobs. This is the most important
requirement of selection test scores. Second, there is no requirement from MCS or professional
practice/standards that these tests be theoretically singular. It is acceptable, perhaps even desired,
that individual tests measure more than one theoretically distinct ability. For example, Reading
Comprehension tests are likely to measure Verbal Comprehension as well as acquired knowledge
relating to the substantive content of the reading passage. Generally, such complexity is not
disadvantageous and may be beneficial for personnel selection tests. They are not designed to test
theories of ability. Rather, the language and meaning of theories about cognitive ability should inform
and guide the decisions about what abilities should be measured and the manner in which the tests
should measure them. For the purposes of these recommend dations about ability test constructs we
will refer to the Kit of Factor-Referenced Tests described in Ekstrom, French, Harman, Dermen (1976)
associated with the 23 cognitive aptitude factors identified in the work by the Educational Testing
Service (ETS). For our purposes, the advantage of this framework is that it identifies a small number
of marker tests for each of the 23 identified aptitude factors. These marker tests enable a clearer
discussion about the types of tests that we recommend to MCS.

The third perspective is that there is a considerable research foundation about the predictive validity
of cognitive ability tests with respect to job performance. Many decisions that MCS will make should
be informed by that research base beginning with this decision about what the tests should measure.

Recommendations

Recommendation 1: Develop subtests in four categories of cognitive ability: (a) Verbal Ability, (b)
Quantitative Ability, (c) Reasoning Ability, and (d) Processing Speed/Accuracy. These categories are
widely used as observed in the six of the seven reviewed batteries, have a research foundation of
predictive validity, and are routinely identified by structured job analysis methods as required for
service jobs.

Recommendation 2: Gather available job information across the service sector to make an initial
evaluation of whether Psychomotor and/or Spatial/Mechanical categories of ability may also be
required by jobs within the service sector.

Recommendation 3: Develop two or more subtests within each selected ability category to measure
specific abilities within each category. Suggestions for specific abilities are described in the rationale
below for this recommendation.

Rationale for Recommendations

Recommendation 1

This recommendation is largely based on the accumulated evidence that cognitive ability tests relating
to Verbal Ability, Quantitative Ability and Reasoning Ability predict job performance across a wide
range of jobs. Further evidence shows that Processing Speed tests predict performance in jobs
requiring the rapid processing of alphanumeric information. For example, in a large scale meta-
analysis among clerical jobs, Pearlman, Schmidt and Hunter (1980) reported average corrected
validities with respect to proficiency criteria for Verbal Ability, Quantitative Ability, Reasoning Ability
and Perceptual Speed of .39, .47, .39 and .47, respectively. With respect to training criteria, these
same ability categories averaged corrected correlations of .71, .64, .70 and .39, respectively. This
research foundation is the likely explanation for the fact that commercially available cognitive ability
batteries used for selection often contain subtests within each of these categories. For example, of
the six reviewed batteries other than Watson-Glaser, five include subtests in all four categories, and
all six included subtests in Verbal Ability and Quantitative Ability.

176
This strategy of including subtests in four diverse ability categories better ensures that some
combination of 2-4 subtests from across the four categories will produce a maximum level of
prediction validity because the combination of 2-4 different ability tests will constitute a measure of
general mental ability.

Recommendation 2

Recommendation 1 to develop subtests in four categories of abilities is intended, in part, to ensure


that performance will be maximally predicted for all jobs within the service sector of civil service jobs.
Nevertheless, there may be specific features of some service sector jobs that require one or both of
two more specialized ability categories – psychomotor skills/abilities and spatial/mechanical ability.
Thorough job analyses would not likely be needed to determine whether spatial/mechanical subtests
will be warranted. Often, available job descriptive information showing significant
use/repair/maintenance of tools, equipment and machinery is sufficient to recognize the potential
value of spatial/mechanical ability tests. For this category, our recommendation is that subtests not
be developed unless job information indicates potential value for such subtests.

The question of whether psychomotor tests have potential value should be addressed somewhat
differently. (We presume that the Study 2 requirement to review only cognitive ability tests is an
indication that MCS has already determined that psychomotor testing is not warranted.
Nevertheless, we note here that, if that decision has not already been made, the decisions about
whether to develop psychomotor tests and what psychomotor skills/abilities should be assessed
would require a more specialized and thorough analysis of the physical ability requirements of the
jobs in questions.

Recommendation 3

The recommendation to develop two or more subtests within each selected ability category is based
on two primary considerations. First, each ability category represents a moderately broad range of
related specific abilities. Including two or more distinctly different subtests in each category improves
the representation of that category and increases the likelihood that Verbal Ability is adequately
represented within the battery. For example, the Verbal Ability category includes at least four
specific verbal ability test types described by Ekstrom et al. (1976) – Verbal Comprehension, Verbal
Closure, Verbal Fluency and Associational Fluency. In addition, the commonly used Reading
Comprehension test type involving paragraph comprehension fits within this category. Also, many
publishers treat Verbal Reasoning tests as part of a group of Verbal Ability tests. This diverse set of
possible specific verbal abilities will be better represented by two or more subtest than by a single
subtest.

Second, distinctively different specific ability subtests within a category increases the opportunity for
one of the subtests to reliably predict performance in any target job family at a higher level than the
other subtest(s) within the category. This rationale is the primary reason publishers commonly adopt
this recommended approach. Perhaps the best example of this approach among the reviewed
batteries is EAS, for which PSI has an extensive record of job-specific validity research. PSI
recommends an empirically optimal combination of subtests for each of several job families. For
example, PSI recommends the combination of Verbal Comprehension, Numerical Ability, Numerical
Reasoning, and Verbal Reasoning for Professional, Managerial and Supervisory jobs. This particular
combination produced a meta-analytic average validity of .57, which was higher than for any other
combination.

Table 74 shows suggested specific ability subtests for each of the recommended categories of ability.
These suggestions are taken from the Study 2 review of seven cognitive batteries as well as the
Ekstrom, et al. (1976) kit of factor-referenced tests. These are not exhaustive suggestions but are
intended to provide examples of the typical diversity of subtests within the broader categories. (Note:

177
Examples of specific psychomotor subtests are not shown here because the choice of specificity
depends to a great extent on the particular psychomotor requirements of the job.) The examples
described in Table 74 are considered to be plausible subtest types for MCS’s civil service system.
Many of the marker tests shown in Ekstrom, et al. (1976) are instructive but not plausible as item/test
types for MCS’s civil service system.

Table 74. Examples of plausible specific ability subtest types within each broad recommended ability
category.
Broad
Ability Name of Specific Ability Subtest Item Type
Category
Verbal Comprehension (Ekstrom, et al.) Vocabulary meaning
Verbal Reading Comprehension (PET) Paragraph meaning
Ability Reasoning to conclusions from a paragraph of
Verbal Reasoning (Verify)
information
Speed test of simple to moderate arithmetic
Number Facility (Ekstrom, et al.)
computation problems (no calculator)
Deriving conclusions from numerical information in
Numerical Reasoning (Verify)
Quantitative a moderately complex problem context
Ability Deriving the missing number in an arithmetic
Calculation (Verify)
equation (with calculator)
Computing answers to arithmetic problems taken
Quantitative Problem Solving (PET)
from business/government contexts
Deriving answers to complex questions about tabled
Following Directions (Ekstrom, et al.)
information
Diagramming Relationships (Ekstrom, et Comparing Venn diagrams to verbally described
Reasoning al) categories
Ability
Reasoning about the relationships among abstract
Inductive Reasoning (Verify)
shapes/objects
Reasoning (PET) Syllogistic reasoning about written information
Mental rotations of drawn shapes to recognize
Card Rotations Test (Ekstrom, et al.)
similarity (difference)
Recognize effects on complex objects of described
Mechanical Comprehension (Verify)
Spatial / actions
Mechanical Answer questions about drawn objects, tools,
Ability Mechanical Understanding (PPM) equipment, machines. (Close to mechanical
knowledge)
Visualize and rotate folded 2-D drawing in its 3-D
Space Relations (DAT PCA)
form
Comparing pairs of long number for sameness or
Number Comparison Test (Ekstrom, et al.)
difference (Speeded)
Processing Finding the same alphanumeric string in a set of
Checking (Verify)
Speed / several strings (Speeded)
Accuracy Processing Speed (PPM) Alphabetize three written words (Speeded)
Determine whether two written names are the same
Name Comparison (GATB)
or different

Specifying Item Content


Background

Once ability constructs are selected for development, decisions must be made about item content.
The seven reviewed batteries are especially instructive for this purpose because they demonstrate
the primary options available to MCS regarding item content. For purposes of these
178
recommendations and MCS’s item-test development work, we describe three key facets of item
content

A. Work relevance
B. Level / Difficulty / Complexity
C. Fluid Ability v. Crystallized Ability

Three levels of work relevance are described: Tier 1 – Work Neutral; Tier 2 – Work-Like Context;
and, Tier 3 – Work Relevant. Tier 1 content relies on acquired knowledge, usually
reading/language/arithmetic skills but the content is not derived from job content and is not intended
to be similar to job content. Tier 2 content relies on acquired knowledge, usually
reading/language/arithmetic skills, but the content is deliberately given a work-like context such as a
table of information similar to what might be used in a work context. However, the work-like context is
not so specific to any particular job that job experience or acquired job knowledge is required to
answer correctly. The intent of Tier 2 content is to convey to the test taker the manner in which the
test content may be related to job performance. Tier 2 content is not used to measure acquired job
knowledge.
Tier 3 content, on the other hand, is introduced for the purpose of measuring some specific acquired
job knowledge. Situational judgment tests often exemplify Tier 3 content. Most reviewed subtests
are based on Tier 1 content, a smaller number are based on Tier 2 content, and no reviewed subtest
was based on Tier 3 content, although Watson-Glaser falls between Tier 2 and Tier 3. We believe
two OPM tests, which were noted in Study 2 but not reviewed, represent Tier 3 content.

The level of item content is an important consideration for two reasons, especially for selection tests.
First, level of content usually influences item difficulty, which influences the precision with which the
test measures ability levels. Overall, it is important that selection tests have adequate measurement
precision in the critical ability range where hired and not hired applicants are discriminated. That
critical range is often a function of the job’s complexity level or educational level. Items written at a
level in the vicinity of this critical ability range are more likely to be psychometrically effective.
Second, item content that is seen by test takers or by hiring organizations as dissimilar to the level of
job content, can be a source of significant dissatisfaction with the test. It is important the item and
test development process develop a level of content appropriate to both job level/complexity as well
as applicant level.

Postlethwaite (2012) recently demonstrated that selection tests measuring crystallized ability (Gc)
tend to have higher predictive validity than test measuring fluid ability (Gf). This is new information
and should have a significant influence on MCS’s decisions about item content. Overall, MCS should
prefer Gc item content over Gf content except where there may be a compelling rationale that offsets
the overall validity difference. Even though this information is new and was not a well-established
view prior to 2010, publishers have shown a strong preference for Gc subtests. Approximately 75%
of the non-speeded cognitive subtests in the seven reviewed batteries were Gc measures.

In addition to these three facets of item content, decisions about item content are also influence by
cost and efficiency considerations. Overall, our professional experience has been that the job
specificity of item content is related to cost of item and test development. Tier 3 content is likely to be
significantly more expensive to develop because it requires more structured, comprehensive
information about the content of job tasks, activities, knowledge, skills, abilities and/or other
characteristics (KSAOs). Also, Tier 3 content is usually motivated by an interest in highly job-specific
assessments of job KSAOs. This strategy is very likely to lead to a significant increase in the number
of required subtests. It is no accident, in our view, that none of the publishers of the reviewed
batteries has adopted a Tier 3 content strategy. They have a strong marketing interest in single
batteries that are attractive the widest possible range of users. This marketing interest happens to
be closely aligned with the science of cognitive ability selection testing which has demonstrated that

179
well-developed cognitive ability tests are predictive of job performance across a wide range of jobs
without requiring job-specific content.

Recommendations:

Recommendation 4: Develop items with Tier 2, crystalized ability content in most, if not all,
subtests. The MCS range of job families is restricted to service sector jobs. This is likely to mean
that Tier 2 content could be developed to be generally similar to the types of information processing,
problem solving and learning contexts common to jobs across this sector of work. There is unlikely to
be a validity, cost or user acceptance argument favoring fluid ability tests with abstract content or
crystallized ability test with work-neutral (Tier 1) or work-relevant (Tier 3) content.

1. Develop reasoning subtests that are framed as tests of “problem solving” or “decision making”
ability in work-like problem contexts rather than using abstract content.
2. Note, MCS’s plan to develop in Phase 2 job-specific assessments such as performance
assessments may require Tier 3 content and the additional job analytic work that such content
is likely to require.

Recommendation 5: Specify a modest range of content levels for all subtests associated with the
range of college education levels of most applicants.

1. Develop items that are likely to span the range of difficulty for this applicant population from
relatively easy, say difficulty near .30 to relatively difficult, say .90. Do not establish a narrow
objective of developing items with fairly homogeneous difficulty levels near a middle value
such as .60.
2. Within each subtest, develop items with Tier 2 content that samples job-like activities and
tasks across different service sector job families (administrative, managerial, technical
support, etc.)
3. Do not develop different subtests with lower content levels for lower level administrative /
clerical jobs than for higher level professional / managerial jobs. Rather, develop moderately
diverse items across a range of job content representative of the range in the service sector.
Each subtest should have sufficiently diverse item content levels to be seen as relevant to the
full range of service sector work.

Rationales for Item Content Recommendations

Recommendation 4

Recommendation 4 is based primarily on three separate considerations

1. Test of crystallized ability has been shown to be more predictive of job proficiency and
training success than tests of fluid ability. The advantage of fluid ability tests that they are
equally relevant, on their face, to all jobs, does not overcome the validity disadvantage.
2. Tier 2 content is likely to increase user acceptance compared to Tier 1 content or abstract
content. Among cognitive ability tests, Tier 3 content is unlikely to improve predictive validity.
3. Unlike the commercially available batteries reviewed for this Study 2, MCS has no business
requirement to develop a single battery that is seen as a relevant across a very wide range of
work. The popularity of Tier 1 content among commercially available batteries is very likely
due to their market considerations, which are quite different than MCS’s “market”
considerations. MCS’s focus on service sector jobs allows it to capitalize on the benefits of
work-like content without causing any restriction in their usage for Saudi Arabia civil service
testing purposes.

180
Recommendation 5

Recommendation 5 essentially proposes that item content for all subtests vary in level/complexity
around an overall level associated with the college level of education of most applicants. This level
may involve reading level but should also involve the complexity of item content so that it is similar to
the range of complexity in service sector work performed by college educated employees. This
approach should result in item content similar in level to the Verify Numerical Reasoning and Verbal
Reasoning subtests as well as the PET Reading Comprehension and Reasoning subtests. Also, the
OPM Reasoning subset exemplifies this approach to item content.

This objective of developing items of modest diversity centered on a level appropriate to college
educated applicants is intended, in part, to support a possible longer term bank management strategy
of estimating IRT-based item parameters that may be used to develop test forms tailored to
differences in job level similar to the manner in which SHL adjusts the theta level of Verify items for
the level of the target job family. A recommendation is made below to develop large item banks over
time to support the automatic production of large numbers of forms. This longer term strategy would
be facilitated by an item bank that has relatively flat distribution of item difficulties.

This recommendation to develop diverse items within a subtest may raise the question of whether
such diversity of level and content would threaten the unidimensionality of the items within a subtest.
Potentially, this could be problematic for the eventual IRT-based approach recommended for bank
management and forms production. We believe it is true that, all else the same, increasing item
diversity within a subtest will tend to slightly reduce the unidimensionality of the item pool. This
modest effect would not be due to variability in difficulty itself. After all, IRT psychometrics explicitly
represent variation in item difficulty and examinee ability levels. The likely source of small increases
in item heterogeneity would likely be the diverse item content associated with differences in job-like
tasks used to generate item topics or other differences in test tasks unrelated to job content. For
example, the items in a Computation subtest frequently include items capturing all four major
arithmetic operations and vary the number of significant digits in number values. Strictly speaking,
both of these common sources of item diversity increase item heterogeneity, thus reducing item
unidimensionality. But this degree of increase in item heterogeneity has never been problematic for
IRT. In other words, the IRT unidimensionality assumption does not require theoretically singular
item pools.

In the special case of subtests designed to measure the types of moderately broad cognitive abilities
recommended here, which are typical of commercially available selection tests, it appears likely that
the positive manifold of cognitive subtests satisfies the unidimensionality requirement sufficiently for
IRT parameter estimates to be stable, consistent, invariant, and useful. Both SHL (Verify) and
Pearson (Watson-Glaser) found this to be true. (As did PSI in their IRT analyses of accumulating
banks of used items for EAS and PET.) While we believe SHL estimates Verify IRT parameters
based on within-subtest parameterization, Pearson concluded that the full set of Watson-Glaser items
spanning all three major subtest domains was sufficiently unidimensional to produce useful, stable
and invariant scores.
We believe that our recommendation to develop diverse items within the specific domain of each
subtest is very unlikely to risk failure of the unidimensionality assumption. This is not because job
complexity is, itself, unidimensional. Rather, it is because diverse measures of narrow, specific
abilities with a broader category of cognitive ability tend to positively correlate with one another
sufficiently to satisfy IRT’s unidimensionality assumption.

Finally, Recommendation 5 suggests that items of diverse levels of psychometric difficulty be


developed. MCS is proposing to use the subtest scores by combining them with other selection
factors possibly including interview scores, experience scores, and other factors. A consequence of
this composite approach is that the critical range of cognitive test scores becomes much wider. If the
lowest score in this range is defined as the lowest score achieved by any hired applicant and the

181
highest score is defined as the highest score achieved by any applicant who is not hired, then this
range of scores is the range within which test scores are expected to help discriminate between hired
and not hired applicants. (That is, all applicants who scored below this range are not hired and all
applicants who score above this range are hired.) In composite systems like MCS is proposing, this
critical range of cognitive test scores can become very wide. A professional standard and principle
for personnel selection tests is that scores within the critical range should have adequate
discrimination power. In cases where this critical, range is wide, as will be the case the proposed
system, the optimal test information function should be relatively flat across this range, rather than
peaked in the middle. The recommended process of developing items of diverse difficulty levels will
support that psychometric objective.

Methods of Item and Test Development

Background

We based the following recommendations on three key assumptions

A. In the initial stage of item and test development large enough samples will not be available to
estimate IRT parameters for new items. IRT estimates will be gathered over time as the civil
service system becomes operational and provides opportunities for large numbers of
applicants to take large numbers of items
B. Phase 1 should develop only the subtests needed to predict performance across the full
range of service sector jobs in Saudi Arabia
C. A sufficient number of fixed forms should be developed initially to support an adequate
security strategy until large numbers of items are developed and banked with IRT estimated
characteristics to produce large numbers of equivalent forms.

Our overall expectation is that the MCS items and tests will initially be developed based on classical
test theory characteristics and then, gradually over time, will accumulate IRT parameter estimates in
order to support a large bank strategy for the production of multiple forms.

Recommendations

Recommendation 6: Establish a long-term objective to develop enough items to support bank-


based forms production to administer a large number of different forms to different applicants.

1. The near-term implication of this bank-oriented strategy is that a sufficient number of items
should be developed initially to support the development of several (4-6) fixed forms for each
subtest, except for the Processing Speed and Accuracy category, assuming that the initial
administration processes will be limited to fixed form delivery, either paper-pencil, online or
both. These fixed forms may be constructed based on classical test theory item
characteristics, to be replaced over time with IRT estimates.
A. If subtests are needed for the Processing Speed and Accuracy category it is very
likely that 2 forms would be adequate.
2. We do not recommend delaying implementation until a large bank of items is sufficient to
support a large number of equivalent forms
3. We anticipate that it would not be feasible prior to implemnetation to conduct pilot studies of
hundreds of items large enough to establish IRT item parameter estimates for large banks of
items. It seems more feasible that such large banks of items will only be developed over time
as operational test administration processes provide the opportunity to pilot new items.

Recommendation 7: Place a priority on the initial development of 6-8 subtests, each with several
forms, rather than a larger number of subtests with fewer forms. Two subtests from each of the
Verbal, Quantitative and Reasoning categories would provide sufficient generality to be predictive

182
across all service sector jobs. We also assume that one or two subtests of Processing Speed
Accuracy will be appropriate for the clerical/administrative jobs.

Recommendation 8: Develop items using standard professional practices for item writing, review,
empirical evaluation and subtest construction. The following features are singled out for special
attention.

1. Specify construct and content requirements based on Recommendations 1 and 2.


2. Capitalize on sample/practice items drawn from existing batteries especially Verify and PET
sample items using Tier 2 content.
3. Job materials should be used as content resources for item writers
a. Item writers should have some exposure to key target jobs such as a tour of job
facilities to observe the manner in job materials represent performance of key tasks
across a range of jobs
4. Item review procedures
a. Item Expert review (Include experts in the development of personnel selection tests)
b. Statistical differential item functioning analyses (e.g., Mantel-Haenszel) for culturally
salient groups, perhaps male-female?
5. Test construction : Develop several fixed forms (4 or 6) based on Classical Test Theory
equivalency,
a. Develop several practice tests per subtest of similar length and complexity. (Practice
tests do not require empirical equivalency. It can be sufficient to develop practice
tests based on item experts’ judgments of item equivalency based on expected
difficulty and construct overlap.)

Rationales for Item and Test Development Recommendations

Recommendation 6

This recommendation is based on the assumption that MCS’s resources for item and test
development will be inadequate initially for IRT-based item and test development. This should not
prevent MCS from developing 4-6 forms for each subtest as the foundation that enables the civil
service systems to be implemented. Once implemented, applicant testing should provide adequate
samples sizes to evolve to an IRT based bank management approach to item maintenance and forms
production.

Our recommendation below about the NCS security strategy is that it should be active and
comprehensive from the beginning, even though unproctored administration is not a likely option.
Having 4-6 equivalent forms for each subtests will enable MCS to (a) administer several different
forms within each paper-pencil test session or cluster of online sessions, (b) announce that different
applicants are receiving different forms of the same tests, and (c) create the likelihood that applicants
who might later talk among themselves will recognize that many of them were given different forms of
the subtests.

Recommendation 7

For purposes of MCS’s Phase 1, 6-8 subtests assessing Verbal Ability, Quantitative Ability,
Reasoning Ability and, as needed, Processing Speed and Accuracy Ability will be sufficient to predict
success in the full range of civil service jobs within the service sector. Validity generalization studies
of cognitive ability tests have shown that 3 or 4 subtests capturing diverse specific abilities will provide
virtually optimal prediction.

183
Recommendation 8

This recommendation is intended primarily to assure MCS that standard professional practice
regarding the development of items and tests will be sufficient to produce effective subtests. In
addition, Recommendation 8 suggests three steps to better ensure that the development of items
content is well-informed by available information about job content. This is not for the purpose of
creating job knowledge tests or even tests that are benefited by acquired job knowledge. Nor is this
intended to develop subtests designed to measure a specific ability to perform specific job tasks.
Rather, this is intended to ensure that the test taker is satisfied that the civil service test appears to be
relevant to the hiring decision. Note Tier 2 content is not adequate to provide content validity
evidence of an item-job linkage because Tier 2 content is not intended to provide a measure of the
KSAO required to perform a specific job task. Tier 2 content only provides a work-like context for the
test taking experience.

We also call attention to item review procedures especially empirical procedures such as Mantel-
Haenszel that may be used to identify items in which group factors appear to add irrelevant variance
to item scores. In MCS’s case, this would be a mechanism primarily for detecting construct invalid
variance rather than a mechanism for addressing social/legal considerations regarding fairness. At
the same item, legal considerations notwithstanding, professional standards for item and test
development call for an investigation of group-related sources of bias where there is some rationale
for suspecting that there is a risk of such bias.

Finally, this recommendation encourages the development of several practice tests consistent with
the recommendation below regarding support policies that MCS provide support for test takers
interest in practice opportunities. The perspective provided below is that the incremental cost of
making practice tests available is small compared to the benefits of such practice opportunities.
Certainly, there is a good will benefit for support test takers interest in preparing for the tests. But
also, practice is known to provide a modest benefit in expected score, especially for novel items
types. An useful way to minimize differences between applicants in level of practice is to provide
easy access to practice opportunities.

Validation Strategies

Background

The recommendations about validation represent a broad view of the purposes, meaning and value of
validation for MCS’s civil service testing system. In this broad sense, validation could be thought of
as program evaluation. Validation evidence would be evidence that the program is achieving its
objectives. These objectives include the test-oriented objectives of measuring the intended abilities
accurately and reliably and predicting the eventual job performance of new hires, as well as
organizational interests in optimizing the multiple interests that will influence the design of the system
including test quality, user acceptance, cost, timeliness, work performance and consistency with
government objectives for the program.

Our overall perspective is that validation will consist of a sequence of steps, eventually evolving into a
steady-state process of routine data gathering and analysis to monitor the ongoing effectiveness of
the system. (This overall approach is manifest especially in PSI’s and SHL’s support of their
cognitive batteries. They have routine mechanisms for gathering job performance data from clients
based on recent hires for whom test results are available.) We recommend which of these steps
should be completed prior to implementation and which may be undertaken after implementation.

Within this broad view, we also acknowledge the importance of gathering validity evidence for all
components of the selection system itself including interviews, experience assessment and other
184
possible factors. This also includes components that may be included even though there is no
expectation that they are job relevant. An example might include points added to applicants’
composite scores based on the length of time they have been waiting as fully qualified. The value of
data about these types of components is that it allows an evaluation of their consequences for the
selection decisions.

Recommendations

Recommendation 9: Develop a long-term validation (program evaluation) strategy that evaluates all
selection factors with respect to their intended use and evaluates key program features such as cost
and timeliness with respect to the needs of the hiring organizations, the applicants, and the Saudi
Arabian government. This long-term validation strategy should address a number of considerations
including:

1. The purposes of the key elements of the system relating to


a. Intended use of job relevant selection factors;
b. Intended use of non-job relevant selection factors, (c)
c. Requirements of hiring organizations including requirements for adequacy of job
performance, turnover rates, speed of position filling, and costs
d. Applicant interests
e. Government interests
2. The sequence in which related information may be gathered. With respect to validation
evidence relevant to the cognitive tests themselves, the sequence should address the
following types of evidence:
a. Ability – Test linkage as described in Study 1;
b. Relationships of newly developed tests to “marker” tests of target abilities
c. Documentation of the validity generalization rationale for the new cognitive ability
tests;
d. The definition and collection of outcomes measures for recent hires including job
proficiency measures, training success measures, and turnover status
e. Applicant score distributions for all selection factors for norming purposes and
decisions about optimizing the manner in which selection factors are combined to
inform selection decisions;
f. Job analysis data relating to the purposes of Phase 2 development of job-specific
assessments such as performance assessments
3. Operational metrics , including
a. Cost per hire; per applicant
b. Speed to fill vacancy
c. Hiring manager satisfaction with new hires (Note: Distinguish system metrics from
professional validation measures)
d. Security processes such as data forensics and web patrols

Recommendation 10: For the new cognitive tests, identify the validity evidence that should be
gathered prior to implementation. We suggest that these include

1. Content evidence relating to the linkage between the target ability domain and the tests as
proposed in Study 1
2. Item development evidence showing lack of measurement bias at the item level
3. Subtest score distribution characteristics in applicant populations
4. Correlational studies of relationships between new subtests and established marker tests
5. Evidence of subtest reliability and inter-subtest relationships
6. Document validity generalization rationale for subtests that addresses type of subtest, job
categories and criterion types

185
Recommendation 11: For all selection factors including the cognitive subtests, the following
information should be defined, gathered and analyzed following implementation to support a post-
implementation validation strategy.

1. Use job analysis methods to define important dimensions of job behavior (See
Recommendation 12.1 for additional detail.))
2. Establish routine methods for gathering job “performance” data for recent hires
3. Develop IRT estimates of item characteristics; introduce bank management; establish item
retention thresholds and established a bank-based forms production strategy
4. Analyze / Monitor relationships between all selection factors and important criterion measures
(proficiency, training success, progression, turnover, etc.)
a. Note: this recommendation is intended to include all selection factors including those
that are not expected to be job relevant so that the effects of the non-job relevant
factors on selection utility may be assessed.
5. Identify potential sources of incremental value for potential Phase 2 job-specific assessments
6. Evaluate the costs-benefits of establishing minimum standards for hiring on the most job
relevant selection components, including the cognitive subtests.
a. Given MCS’s intention to use the cognitive test scores by combining them with other
selection factors, we recommend that MCS evaluate the consequences of this
approach. To the extent that the cognitive tests are one of many selection factors,
their usefulness may be significantly reduced. A possible approach to maintaining a
minimum level of usefulness for cognitive tests would be to establish a low
(presumably) score threshold that applicants must achieve, independent of other
selection factors, to be considered further.

Recommendation 12: Develop a an approach to criterion measurement that is applicable in a


uniform manner across all civil service jobs

1. Analyze jobs to identify important facets of valued work behavior. Do not limit this effort to
just the proficiency and training success factors that are most commonly associated with
cognitive prediction. Design the job analysis process to identify the full range of valued job
behaviors including contextual behaviors such as helping behavior, counterproductive
behavior, and citizenship behaviors such as organization loyalty.
a. Identify broad criteria than may be described at a level of generality to apply across
all civil service job families within the service sector. Potential examples include
“Completes all work accurately” and “Completes all work on time”.
b. Identify job-family specific criteria that are meaningful only for specific jobs or
narrower groups of jobs. Potential examples include “Handles customer interactions
effectively” and “Supervises direct report effectively”.
2. Design and implement an electronic system (email? Internet?) for routinely gathering
supervisors’ ratings of recent hires’ work behavior.
a. Note: distinguish between this type of standardized data gathered for the purpose of
validation research and process management feedback data such as managers’
“customer satisfaction” ratings for new hires.

Note: the rationales for Recommendations 9-12 are embedded in the description of the
recommendations themselves.

Comments about Criterion Measurement and Predictive Valdity Estimation

Recommendation 9 suggests a broad view of MCS’s validation effort. This validation effort would be
both ongoing and broad in scope. The ultimate purpose of this view is to ensure that Saudi Arabia’s
civil service selection system is as effective as possible, not merely to establish that the cognitive
tests predict job performance. For the purposes of this set of comments, we assume that Saudi
Arabia’s civil service system has responsibility for identifying and measuring all selection factors that
186
will be combined to inform government organizations’ selection decisions. These selection factors
might include the cognitive tests addressed in this Report as well as interview ratings, academic
records, records of previous work accomplishments, work skill measures such as typing skills,
personality assessments, and the like. (We do not include other selection factors unrelated to job
behavior such as the amount of elapsed time an applicant has been “waiting” for a job offer.)

The starting point for this broad validation effort is to identify and define the job behaviors and results /
outcomes that the hiring organizations intend to influence with their selection system. Once identified
and defined, these will guide the development of the criterion measures against which the predictive
validity of the selection factors will be evaluated. Two primary sources are typically used to generate
this list of intended behaviors and results/outcomes – organization leadership/management and/or
some form of job analysis. We encourage MCS to rely on both sources of information about the
important job behaviors and/or results / outcomes.

MCS should consider two broad types of criterion measurement methods should be considered (1)
existing administrative records, and (2) ad hoc assessments designed specifically for the purposes of
the validation study. These are usually take the form of ratings by supervisors/managers.

Existing Administrative Records

In most cases, organizations create and maintain records of employees’ work behavior and results for
a variety of purposes such as performance management, productivity metrics, appraisal ratings,
development planning, and compensation or promotion decisions. In the case of Saudi Arabia’s civil
service jobs that are primarily in the service sector of work, these might include existing measures
such as appraisal ratings, merit-based compensation decisions, promotion readiness assessments,
training mastery and various work behavior measures, especially in customer service work, such as
the number of customers handled per day and work accuracy.

While the examples above are focused on work behavior and results, we also encourage MCS to
consider other existing records that capture employee outcomes not specific to work tasks but that
have value for the organization. These include outcome data such as safety records, injury
instances/costs, work-affecting health/sickness records, turnover, absence/tardiness records, and
records of counterproductive behavior such as employee theft as well as records of employee
citizenship such as awards or honors.

We also draw special attention to the possibility of training mastery records, given that the MCS civil
service exam system is intended to focus on cognitive ability. It is widely accepted that the major
psychological mechanism by which cognitive ability predicts work behavior is that cognitive ability
enables the effective learning of job knowledge, which is a direct determinant of effective work
behavior. MCS should give special consideration to available records of training mastery that might
be created during employee training events that are intended to enable effective work behavior. If
such records are available and sufficiently free of confounds or irrelevant artifacts, they often
represent the criterion measures that most directly capture the intended benefit of cognitive ability-
based selection.
Of course, the mere fact that an existing administrative record is used by the organization and is
available does not mean that it should be used as a criterion measure for purposes of predictive
validation. Existing records were not developed to serve as criteria in validity studies. As a result,
they may have significant flaws and may not provide a meaningful evaluation of the predictive validity
of a selection procedure. These considerations are addressed below under the headings of
Strengths and Limitations.

Strengths. In general, existing administrative records that are seen as relevant to the work
behaviors associated with the selection procedures being validated have three characteristics that are
typically important for validity study criteria: (1) they are, by definition, important to the organization;

187
(2) they are well-understood and in the language of managers and supervisors in the organizations;
and, (3) they are available at relatively little cost. A general guiding principle of good professional
practice is that selection procedures be designed to predict work behaviors/outcomes that are
important to the organization. Validation study procedures often address this issue of importance by
gathering ratings of importance for tasks/activities/behaviors/knowledges documented in structured
job analyses. This need to ensure the importance of validity criteria may also be addressed by the
use of existing administrative records.

Certainly, administrative records such as appraisal ratings, promotion-readiness ratings, performance


metrics and training mastery assessments provide a clear indication that the organization attaches
importance to the information they provide for the broader purpose of managing worker performance
and behavior. For instance, if the performance of customer service representatives is managed by
providing feedback and coaching to maximize the number of customers handled per hour, then that
work outcome is, by definition, important to the organization. (Note, this point does not mean that a
particular organizational tactic such as managing the number of customers handled per hour is
necessarily an effective method for optimizing overall organizational results. Sometimes
organizations manage to the wrong metrics. A classic example in customer service work is the
occasional practice of managing to low average customer talk time. Often this performance metric
leads to customer dissatisfaction, which can lead to a loss of customers. However, validity studies
should define criterion importance as the organization defines it, assuming every effort has been
made to ensure the organization understands the meaning of importance as it is relevant to the
question of predictive validity of selection procedures.)

In our view, the purpose of validation is to inform the organization about the effectiveness of its
selection system, not merely to inform researchers about the success of their selection design and
implementation effort. This purpose requires that organization managers and leaders understand
and attach importance to validation results. Our advice is that MCS treat the leaders of the hiring civil
service organizations as the primary beneficiaries of its validation research efforts. Certainly,
validation efforts will also be critical to the developers and implementers of the testing system who
have operational responsibility for its optimization. But, ultimately, validation efforts will be
meaningfully linked to the intended purposes of the selection procedures to the extent organization
leaders understand the evidence and recognize its importance. The use of existing administrative
records – assuming they satisfy other evidentiary requirements – usually helps ensure that the
validation effort reflects actual organizational values and managers and leaders of the hiring
organizations understand and attach importance to the results.

Validation can be an expensive undertaking both in terms of money and time. Our advice is that that
an initial criterion validation effort be conducted that is both affordable and timely. This could
possibly be a concurrent study prior to implementation, although this would not be necessary. This
would be followed by a more programmatic post-implementation criterion validation strategy.
Certainly, the availability of existing administrative records that are meaningful and unconfounded
would greatly facilitate such an early criterion validity study.

Limitations. While existing administrative records can have important strengths, they are often
subject to problems that render them unusable as validation criterion measures. These limitations
typically arise from the fact that they were not developed to serve as validity criterion measures.
They were developed to serve other operational purposes. Three main types of limitations often
plague administrative records: (1) they often have unknown and suspicious psychometric properties;
(2) they can be confounded by other influences; and (3) they may be conceptually irrelevant to the
intended uses of the tests being validated.

A significant limitation with existing administrative records is that, usually, their psychometric
properties are unknown and, in many cases, there is reason to be suspicious of their psychometric
properties. The two primary considerations are their reliability (usually in the sense of stability over

188
time or agreement between raters) and their validity. Unfortunately, in many cases there is no
opportunity to empirically evaluate either the reliability or validity of existing records. An exception
occurs where the same measure, for example a performance appraisal or performance metrics, is
gathered on multiple occasions over time for the same sample of incumbents. In that case, it may be
possible to empirically assess the consistency of such measures over time with a sample of
incumbents. Inconsistency over time can occur for at least two reasons. First, the measure itself
may be unreliable in a psychometric measurement sense. This might be the case, for example, with
untrained supervisors’ ratings of promotion readiness. Second, even if the measurement process
itself is highly reliable, the work behavior may not be consistent over time or between employees.
This is common with many types of sales metrics. For example, a common sales management
practice is to give the best sellers the most important and largest sales opportunities. This can have
dramatic consequences for sales results. Also, seasonality sales trends can impact differences
between sales people’s results. As a result, even if the measurement of sales results is highly
reliable, the results themselves can be inconsistent over time and sellers. In this case, it is important
to understand that a highly reliable measure of inconsistent behavior/results makes a poor validity
criterion.

In most cases, empirical assessments of the psychometric properties of existing records will not be
possible. In those cases, expert judgment must be relied on to evaluate whether the existing records
are likely to have the necessary psychometric properties.

Perhaps the most common limitation of existing records is that they are confounded by other
influences that add irrelevant / invalid variance to the measures; as a result, they become unusable.
Common examples include appraisal ratings that may be artificially constrained by forced distribution
requirements or by supervisors’ strategies of evening appraisals out over time. Another example is
the use of locally developed knowledge tests as measures of training mastery. Often, such tests are
not designed to discriminate among trainees but, rather, to provide trainees the opportunity to
demonstrate their mastery of the training content. This can lead to ceiling effects in which a high
percentage of trainees receive very high scores, thus greatly restricting the range of scores.
Similarly, training mastery tests are sometimes used as a mechanism for feedback about acquired
knowledge. In that case, the process of completing the tests might include feedback about correct
and incorrect answers which can lead, eventually, to high scores such that a very high percentage of
trainees “successfully” complete training.
In or experience, there is rarely any opportunity to empirically evaluate the extent to which existing
records are confounded. Expert judgment is almost always necessary to determine whether the
degree of confounding invalidates the measure as a possible validity criterion.

Finally, an important consideration is whether an available record of work behavior is relevant to the
intended benefit of the selection procedure in question. This can be a significant consideration with
regard to the validation of cognitive ability tests used for selection purposes. Given that cognitive
ability is understood to influence job behavior by facilitating the learning of job knowledge, records of
work behavior that are unlikely to be influenced by job knowledge should be regarded as irrelevant to
the validation of cognitive selection tests. For example, turnover is a frequent work behavior of great
importance to organizations. But job knowledge and, therefore, cognitive ability may or may not be
relevant to turnover. In contexts in which turnover is largely driven by the attractiveness of available
alternatives in the employment market, for example better paying jobs doing similar work in a more
convenient location, job knowledge and cognitive ability may not predict turnover. In contrast, in
contexts in which job performance is a large determiner of turnover, job knowledge and cognitive
ability may be significant predictors of turnover. Another example often occurs in customer service
jobs in which adherence to customer interface protocols are required in a high volume of repetitive
customer interactions. Successful performance of this type of task may depend far more on
sociability and emotional resilience than on easily learned job knowledge. Cognitive ability tests
would not be expected to predict such performance because the specific role of cognitive ability in
performance would not be relevant to this particular form of job performance.

189
This point distinguishes between invalidity and irrelevance. Criterion measures that are properly
understood to be irrelevant to the construct/meaning of a cognitive ability test should not be used as
criteria in the validation of that cognitive ability test. Validity studies should be designed to provide
evidence of the extent to which the selection procedures are achieving their intended purpose. That
is, irrelevant work measures do not provide evidence of validity.

(It is important to note here that this point about the relevance of criterion measures is not inconsistent
with Recommendation 9 above to view the overall validation effort in the broad context of program
evaluation. We are assuming that MCS will have responsibility for more than just the cognitive ability
tests used within the civil service employment system. We assume MCS will also have some degree
of responsibility for the other factors in the selection system. This broader view in Recommendation
9 means, among other things, that the design of the validation strategy should take into account all
selection factors, not just the validity of the cognitive ability tests.)

All of the above issues considered, we encourage MCS to give careful consideration to existing
administrative records as a source of criterion measures. In our experience, certain types of
administrative records are often too confounded / unreliable to be useful. These include: (a)
performance appraisal ratings by untrained supervisors where those ratings influence compensation
decisions such as raises and bonuses (the linkage to compensation often introduces significant
irrelevant considerations); (b) training mastery tests that are so “easy” the large majority of trainees
achieve very high scores; (c) sales records where there are significant and frequently changing
differences between sellers in their opportunity to sell; and (d) peer ratings as might be used for 360
feedback purposes in which peers have a potential conflict of interest due to the use of 360 ratings to
help make promotion decisions. In contrast, other existing administrative measures often have
enough relevance and (judged) reliability to be adequate criterion measures. These include: (a)
performance metrics in customer service jobs, especially call centers where the performance metrics
are automatically gathered and most employees are working under common conditions; (b) records of
completed sales or “qualified” callers in an online sales environment where the sales roles and
responsibilities are relatively uniform across all sales representatives; and (c) attendance records of
tardiness and absence including days missed due to work-related illness or injury.

Ad Hoc Assessments

In most cases where existing administrative measures are useable as criteria, supplemental ad hoc
criterion measures can add important value. Often, the benefit of ad hoc measures is to complement
existing measures to avoid criterion deficiency and to provide a method for assessing some form of
overall performance or value to the organization. Examples of common ad hoc assessments include

 Ratings of overall performance and/or contribution to the organization


 Ratings of specific facets of overall performance such as customer interface effectiveness,
work accuracy, work efficiency, team participation,
 Ratings of work-related skills and knowledge
 Ratings of potential of readiness for progression and promotion
 Ratings of citizenship behaviors such as helping behaviors and loyalty behaviors
 Ratings of counterproductive behaviors such as non-collaborative behaviors, behavior
antagonistic to the goals of the job or organization, theft, malingering, etc.
 Scores on ad hoc job knowledge tests
 Scores on ad hoc high fidelity work sample tasks

Ad hoc assessments are virtually always specific to the particular job. A common source of job
information used to specify the ad hoc assessments is some form of structured job analysis that seeks
to identify the most important skills, outputs, tasks/activities based on incumbent and supervisors
ratings of many specific job tasks/activities/abilities. An alternate, often less costly approach is to use

190
existing job descriptive material that identifies major duties, responsibilities, tasks/activities as a
source of broad information about the job followed with focus groups discussions with job experts that
specify the particular job outputs, tasks, behaviors associated with the documented job information.
Virtually always, the source of job information used to identify and specify ad hoc criteria is job expert
judgment elicited in some structured fashion whether in the form of job analysis surveys, focus group
outputs or individual expert interviews. In effect, this expert source is the most common method of
ensuring the validity or relevance of the ad hoc criterion.

The most common measurement method used to assess ad hoc criteria is ratings by supervisors and
managers. Woehr and Roch (2012) provide an excellent summary of recent research and discussion
about supervisory ratings of performance. Also, Viswesvaran (2001) provides an excellent summary
of the broader issues of performance measurement. Both acknowledge the usefulness and
professional acceptability of supervisory ratings while at the same time providing information about
their limitations.

Assuming that job expertise was appropriately used to specify relevant content for such rating scales,
the primary concern about supervisory ratings is their relatively low reliability. Although there are
different aspects of reliability (internal consistency, stability over time, etc.) and different methods for
estimating reliability (parallel forms, test-retest, split-half, internal consistency, inter-rater reliability,
etc.) a commonly used estimate of supervisory rating reliability is .62. This estimated value is
frequently used in meta-analyses of validities based on supervisors’ rating where no reliability
evidence is presented in the particular studies. While this level of reliability is marginal and would not
be considered adequate for measures to be used as the sole basis for making employment selection
decisions, it is commonly accepted professionally given the purposes supervisory ratings typically
support. Typically, supervisory ratings are one source of validity information used to provide
confidence about the appropriateness of a particular selection procedure.

Given this marginal level of reliability, it is important to develop supervisory rating scales following
professionally accepted methods that will maximize the level of reliability. Three major elements of
the development process are important to ensure maximum reliability (and validity): (1) adequate job
expertise; (2) careful specification of rating scales; and (3) adequate rater training.

Adequate Job Expertise. This requirement manifests itself in two ways. First, the process of
identifying and specifying ad hoc criteria almost always relies on the judgments of job experts about
the facets of performance and work behavior that are important enough to include as validation
criteria. Second, the raters who assess the performance and behavior of the participating employees
should be job experts and sufficiently familiar with the target employee to make accurate ratings. A
common threshold for familiarity is 3-6 months supervising the employee. Also, the supervision
should include adequate exposure to the employee’s work behavior and performance. While this can
be difficult to judge in some cases, the rater should be willing to endorse his own familiarity with the
target employee.

Careful Specification of the Rating Scales. Generally, well-constructed rating scales will include
behavioral anchors (descriptions) for at least the high and low ends of the rating scale. These
anchors are descriptions of the types of behavior and/or performance levels that exemplify the
meaning of the rating level associated with the anchors. These anchors are typically developed
following structured processes involving job experts who are able to generate descriptions of high and
low levels of behavior/performance. Also, a common practice is to use two stages of development of
anchors by having guided experts develop draft anchors in one stage and then have a separate group
of equally expert judges resort the draft anchors into their original scales and levels. Anchors that
Stage 2 experts associate with different scales and/or different levels would be removed.

Adequate Rater Training. There is a clear professional consensus that rater training is important to
the validity and reliability of supervisory ratings. At a minimum this training describes the meaning of

191
the rating scales and their anchors and a clear description of common rater errors such as halo
effects, leniency, and central tendency. Better training would include opportunities for multiple raters
to observe the same performance / behavior samples, make ratings and then compare their ratings
with feedback and discussion about the relationship between the observed behaviors / performance
examples and the meaning of the rating scales.

The Psychometrics of Criterion Measures and the Estimation of Operational Validity


Like other personnel selection professionals who design and conduct validation studies using
supervisory ratings, Sabbarah / MCS may wonder what standards are applied to determine whether
specific rating scales are good enough to include in a study. While professional standards and
practice do not provide precise guidance about this, a professionally reasonable guideline would be
based on the three development elements described above. To the extent these three elements
were meaningfully applied in the development of a rating scale, Sabbarah / MCS could be confident
that the rating scale is relevant (i.e. valid) and psychometrically adequate (i.e. reliable) enough to
include as a criterion measure in a validity study. To the extent that any one of these elements is
missing from the development process, the risk increases that the rating will not add useful
information.

An additional tactic for protecting against inadequate individual scales is to factor analyze the multiple
scales used to gather ad hoc assessments, identify the small number of meaningful factors and
create factor-based composite scales by combing the individual scales loading on the same factors
into composite measures. This tactic requires that several rating scales be developed to sample or
represent the criterion “space” captured in the expert-based analysis of job performance and
behavior. Of course, a non-empirical approach to the same tactic is to rationally identify the major
criterion space factors prior to the validation study, using well-instructed job experts. This would
enable the criterion composites to be formed prior to data gathering. The advantage of the factor
analysis approach is that it provide some empirical evaluation of the measurement model underlying
the rating scales.

In general, criterion unreliability is well-understood to be an artifact that negatively biases validity


estimates. That is, all else the same, less reliable criteria produce lower observed validities than
more reliable criteria. Further, any unreliability in a criterion measure results in a validity estimate
that is lower than the level of validity representing the relationship between the selection procedure as
used operationally and the actual criterion as experienced and valued by the organization. That is,
the organization benefits from actual employee behavior and performance, not imperfectly measured
behavior and performance. For this reason, the validity estimate that best represents the real
relationship between the selection procedure as used and the criterion as it impacts the organization
is an observed validity that has been corrected for all of the unreliability in the criterion measure.

As a standard practice, Sabbarah / MCS should correct observed validities for the unreliability of the
criterion measures to produce a more accurate index of the operational validity of a selection
procedure. (This is the well-known correction for attenuation described in many psychometrics texts.
The corrected validity is defined as the observed validity divided by the square root of the observed
criterion reliability index.) By this method of correction, the psychometric bias introduced by any
imperfectly reliable criterion, including marginally reliable supervisory ratings, can be eliminated from
the estimation of the predictive validity.

However, one other artifact in a common type of predictive validity study introduces a separate
negative bias into observed validities. Range restriction in the predictor (and criterion) occurs when
the sample of employees for whom criterion data is available has a narrower range of predictor scores
than the applicant population within which selection decisions are made based on the selection test.
This restriction occurs in validity studies where the test being validated is used to make the selection
decisions that determine which applicants become employees whose work performance measures
are eventually used as the criteria. (Validity studies that administer the test to applicants but do not

192
use the test scores to make selection decisions may avoid all or most of this range restriction.
Unfortunately, such studies are unusual because they are significantly more costly because they
require the employer to hire low scoring applicants who are likely to be lower performing employees.)

Like the negative bias due to criterion unreliability, the negative bias due to range restriction results in
the observed validity coefficient underestimating the actual, or operational, validity of the test.
Because the employers benefits from avoiding the low performance levels of rejected applicants, the
validity coefficient that most accurately represents the effective validity of a test is one that has been
corrected for range restriction, as well as criterion unreliability. This corrected, more accurate validity
is known as the operational validity of a selection procedure and is the validity of interest to an
organization. This is why Sabbarah / MCS should correct all observed predictive validities for both
criterion unreliability and for range restriction. These corrections and the underlying psychometrics
justifying these corrections are described in many sources including Hunter and Schmidt (1990).

Overall, the typically marginal levels of reliability with supervisory ratings should be taken into account
in two separate ways.

1. The development of such ad hoc measures should include three key elements relating to the
use of job expertise, scale construction, and rater training to minimize the risk of inadequate
reliability.

2. The empirical estimation of operational validity, which is the validity index most important to
the hiring organizations, should correct for the psychometric effects of unreliability in any
criterion measure, including those with marginal reliability.

(Note, we have addressed the problem of marginally reliable supervisory ratings in the context of ad
hoc criteria developed for the purposes of the validity study. However, exactly the same problem
exists, and likely more severely in many cases, with existing administrative measures such as
performance appraisal ratings that also rely on supervisory ratings.).

Recommendations about Content Validity

Background

Given Sabbarah’s strong interest in the role of content validation evidence, these recommendations
were developed to help clarify two primary points relating to content-oriented evidence for cognitive
ability selection tests.

A. Professional standards and principles about content validity for selection tests focus on the
nature of content evidence linking test content to job content. This linkage is necessary to
infer from content evidence that test scores will predict job performance.
B. Even the recent arguments that job-oriented content evidence may be available for cognitive
ability tests (e.g., Stelly & Goldstein, 2007; Schmidt, 2012) acknowledge that content
evidence is not relevant to tests of broad cognitive abilities that are not specific to identifiable
job tasks and/or KSAOs. Our view is that the Phase 1 cognitive subtests will be like this.
Content evidence will not be available for the test-job linkage

This set of recommendations and the rationale provided are informed by four primary considerations:

A. The Study 1 proposal regarding the role of content validity for item and test development;
B. The Study 2 findings regarding the absence of content validity evidence for any of the seven
reviewed cognitive ability batteries;
C. Professional guidance regarding content validity evidence for employment tests; and,

193
D. MCS’s proposal to stage the development of the civil service exam system such that the
focus of Phase 1 will be on a set of cognitive abilities that are applicable across the full range
of civil service jobs and a later Phase 2 will focus on more job-specific testing such as
performance assessments.

Recommendation 13: Acknowledge the difference between job-oriented content validity evidence
and ability-oriented content validity evidence. These are different types of evidence and support
different inferences about test scores.

Recommendation 14: Where content evidence is intended to support the inference that test scores
predict job performance, apply the definition and procedures of job-oriented content validity that have
been established for personnel selection tests in the relevant professional standards (I.e., NCME,
AERA, APA Standards for Educational and Psychological Testing (Standards), and SIOP Principles
for the Validation and Use of Personnel Selection Procedures (Principles). Phase 1 tests are unlikely
to provide job-oriented content evidence whereas job-oriented content evidence for Phase 2 tests
maybe important and efficient to obtain.

Study 1 Proposal

Study 1 proposes a different type of content validation evidence than is described in the Standards
and Principles for employment tests.

Perhaps the most critical distinction between the content validity evidence described in Study 1 and
content validity evidence described in professional standards relating to employment selection is the
domain of interest. Both approaches to content validity begin with the requirement that the domain of
interest be clearly specified. In both cases, the domain of interest refers to the domain of
abilities/activities/behaviors to which test scores are linked by content-oriented evidence. In the case
of cognitive employment tests developed for MCS’s civil service tests, Study 1 proposes that the
domain of interest as the CHC theoretical framework that provides a taxonomy of cognitive abilities.
“The first layer (of the recommended ECD process) is domain analysis. Here, the test developer
specifies lists of concepts according to the Cattell-Horn-Carroll (CHC) intelligence theory, which Keith
& Reynolds (2010) believe serves as “a ‘Rosetta Stone’…for understanding and classifying cognitive
abilities.”” (pg. 5) In clear contrast, the Principles for the Validation and Use of Personnel Selection
Procedures (Society for Industrial and Organizational Psychology, 2003) describes this first step of a
content validation process as, “The characterization of the work domain should be based on accurate
and thorough information about the work including analysis of work behaviors and activities,
responsibilities of the job incumbents, and/or the KSAOs prerequisite to effective performance on the
job.” (pg. 22).

Essentially, the Study 1 content validity methodology, which is about education assessment,
describes the domain of interest as the abilities the items and tests are designed to measure.
Whereas professional principles for content validity of employment tests describe the domain of
interest as the work activities/behaviors the items and tests are designed to predict. The reason for
the difference follows directly from the difference in the purposes of the two types of ability
assessment. The typical purpose of educational assessments is to assess achievement or capability
or some other characteristic of the individual’s level of the target skill/ability/knowledge. This requires
that item and test scores are interpretable as accurate measures of the target skill/ability/knowledge.
In contrast, the universal purpose of employment tests is to inform personnel decisions, which
requires that item and test scores are interpretable as accurate predictors of work performance.

These are two forms of content evidence that inform different inferences about item and test scores
necessary to support different purposes of the tests. ECD-based content validity informs
measurement inferences because it is ability oriented; employment test content validity informs
prediction inferences because it is job oriented. The development of employment tests will benefit

194
from both types of content evidence because employment tests should not only predict work behavior
but should also be valid measures of the abilities they are designed to assess.

Findings from the Review of Seven Batteries

None of the seven reviewed batteries is supported with content validity evidence. Two primary
reasons for this, presumably, are (a) US regulations governing matters of employment discrimination
view content validity evidence as irrelevant to cognitive ability tests, and (b) it is widely agreed that
content-oriented evidence is irrelevant to certain types of commonly used general, broad cognitive
ability because their content does not lend itself to a comparison to job content and they are often not
designed explicitly to represent important elements in the job domain. For example, abstract
reasoning tests do not lend themselves, usually, to content-oriented validity evidence because they
are not developed to sample any particular work behavior and their content is not comparable to
typical descriptions of job content.

This lack of content evidence is not a criticism of the likely validity of the seven batteries or of the
developers’ professional practices. It is simply an acknowledgement of the widely held view that
certain types of commonly used cognitive ability tests were not developed to sample job content and
do not have the specificity of content to enable content evidence to be gathered. In our professional
experience this finding about the lack of reliance on content validity is typical of commercially
available cognitive ability employment tests. This point will have important implications for MCS’s
overall validation strategy as captured in Recommendation C.

Further, this lack of content validity evidence should not be interpreted to mean that developers or the
personnel selection profession in general have little interest in other sources of validity evidence than
predictive validity evidence. Almost all of the reviewed batteries are supported by some form of
construct validity evidence, most often in the form of correlations with separate tests of similar
abilities. In addition, although we were unable to gather detailed information about item and test
development practices in many cases, professional guidance for employment tests calls for a variety
of development procedures intended to improve the measurement validity of items and tests,
including psychometric analyses of differential item functioning to reduce group-based sources of
invalid variance (bias) and reviews of newly developed items by cultural experts to identify potentially
biasing content. In addition, a common practice is the use of pilot studies to evaluate the
psychometric characteristics of new items to screen out underperforming items, including items that
appear to be measuring something different than other items designed to measure that same abilities.

Professional Guidance about Job-Oriented Content Validity for Employment Tests

Professional guidance about employment tests describes five major processes necessary to establish
content evidence of validity.

1. An appropriate job analysis foundation that provides credible information about the job
2. A clearly defined job content domain describing the important work behaviors, activities,
and/or worker KSAOs necessary for effective job performance
3. Appropriate procedures for establishing key linkages (a) between the job analysis and the job
content domain, (b) between the job content domain and the test development specifications,
and (c) between the test development specifications and the item content.
4. Appropriate qualifications of the subject matter experts who will make judgments, such as
importance ratings, about the key linkages
5. Adequate documentation of all methods and procedures

The Study 1 effort apparently to incorporate certain aspects of job-oriented content evidence is not
adequate to satisfy these professional standards. The most significant gaps are (a) the lack of a well-
specified job content domain, (b) the absence of job content domain considerations in the

195
development of the test plan specifications and item content, and (c) an inadequate linkage between
item content and the job content domain. Study 1 proposes that each item be rated by job experts
for its importance “for use in civil service”. But this reference to “civil service” as the domain of
interest for importance ratings is not sufficient to link items to the “most important work behaviors,
activities, and / or worker KSAO’s necessary for performance on the job…” as prescribed by SIOP’s
Principles.

MCS’s Staged Approach to Test Development

MCS has established a plan to develop and use a set of general cognitive ability tests during the first
phase of its civil service system that will be used to inform hiring decisions for all civil service jobs.
Our recommendation (see below) is that these tests not have job-specific content and not be
designed to sample important job content. MCS’s plan is to introduce job-specific tests, such as
performance tests and, possibly, tests of job-specific skills/abilities/knowledge in a second phase of
the civil service system.

This distinction between Phase 1 and Phase 2 tests is critical to the relevance of job-oriented content
validity evidence. Our recommendation (see below) is that Phase 1 tests be similar in many respects
to the most common types of tests represented in the reviewed batteries capturing cognitive abilities
in the verbal, quantitative and problem solving/reasoning domains. We also recommend that Tier 2
content be built into the items in these tests but that none of these tests be developed explicitly to
sample important job content. If MCS proceeds in this recommended manner, job-oriented content
validity evidence will not be relevant because item/test content was not developed to sample job
content.

Rationale for Recommendations

Recommendation 13: Acknowledge the difference between job-oriented content validity evidence
and ability-oriented content validity evidence. These are different types of evidence and support
different inferences about test scores. Both types of evidence can be valuable.

The ECD-based content evidence described in Study 1 is evidence about the linkage between items,
tests and the abilities they are intended to measure. The value of this evidence is that it allows
scores to be interpreted as measures of the intended abilities. In contrast, the content evidence
described in professional standards for employment tests is about the linkage between items, tests
and the job content they are intended to sample. The value of this evidence is that it allows scores to
be interpreted as predictive of job performance. Recommendation A asserts that these are separate
types of evidence and both have value. However, ability-oriented content evidence does not, itself,
imply that scores are predictive of job performance. By being oriented toward a theoretical model of
abilities that have been identified independent of their relationship to job performance, this ability-
oriented evidence cannot directly establish a link between scores of job performance. Further, item
importance ratings that refer to importance for “use in civil service” are not sufficient to establish this
score-job linkage because “use in civil service” does not specify the behaviors/activities/KSAOs
important for effective performance.

This recommendation supports the general concept of ability-oriented content evidence within the
ECD process model but cautions that it does not substitute for job-oriented content evidence.

As an aside to this recommendation, we note here that reliance on the ability-oriented ECD process
as described in Study 1 is likely to lead to the identification of many more ability constructs and the
development of a potentially much larger number of subtests than selection research has shown are
needed to achieve maximum predictive validity. Our broad suggestion is that the initial domain
specification in the ECD approach should include considerations of job content in the service sector
and previous relevant predictive validity research results, as well as a theoretical model of cognitive

196
abilities. Such an approach to domain specification that is more job oriented is likely to yield a much
more efficient, lower cost item and test development effort.

Recommendation 14: Where content evidence is intended to support the inference that test scores
predict job performance, apply the definition and procedures of job-oriented content validity that have
been established for personnel selection tests in the relevant professional standards (I.e., NCME,
AERA, APA Standards for Educational and Psychological Testing (Standards), and SIOP Principles
for the Validation and Use of Personnel Selection Procedures (Principles). Phase 1 tests are unlikely
to provide job-oriented content evidence whereas job-oriented content evidence for Phase 2 tests
maybe important and efficient to obtain.

We anticipate that the tests developed for use in Phase 1 are unlikely to be designed to be samples of
important job content. Rather, they are likely to be somewhat broad, general ability tests relying on
Tier 1 (work-neutral) and/or Tier 2 (irrelevant work context), which does not enable job-oriented
content evidence to be gathered. In contrast, it appears that MCS’s approach to test development in
Phase 2 will emphasize job-specific tests. While these tests might take several different forms (e.g.,
performance tests, situational judgment tests, job knowledge tests, and cognitive ability tests requiring
job-specific learning (Tier 3)) these types of tests designed to sample job-specific content are likely to
enable the collection of job-oriented content validity evidence. For these types of cognitive tests,
content validity is likely to be an important and efficient form of evidence. In this case, the
professional standards relating to content validity for employment tests should be applied to the
validation process.

The Role of Content Validity in the Development of GATB Forms E and F

In an effort to more completely address the possible role of content validity evidence in the
development of GCAT cognitive ability tests, we more closely reviewed the processes used to
develop GATB Forms E and F to describe and evaluate the role of content validity processes and
evidence in that development effort. We reviewed Mellon, Daggett, MacManus and Moritsch (1996),
Development of General Aptitude Test Battery (GATB) Forms E and F, which is a detailed, 62-page
summary of the presumably longer technical report by the same authors of the same GTAB Forms E
and F development effort. The longer technical report could not be located. However, the summary
we reviewed provided sufficient detail about item content review processes that we are able to
confidently report about the nature of that process and its possible relationship to content validity. In
particular, pgs. 11-15 addressed “Development and Review of New Items” and provided detailed
information about the nature of the expert review of the newly drafted items.
Our focus was on the item-level review process used to evaluate the fit or consistency between item
content and the meaning of the abilities targeted by the newly developed items. Several key features
of this process are described here.

1. All new items were reviewed by experts, which led to revisions to the content of specific items
in all seven subtests
2. The focus of these reviews, however, was not primarily on the fit or linkage between item
content and the target aptitudes, but instead was on possible sources of item bias relating to
race, ethnicity and gender. However, this “item sensitivity” review was closely related to an
item validity review because sources of race, ethnic, gender bias were regarded as “bias”,
that is, sources of invalidity.
3. A highly structured set of procedures was developed including bias guidelines, procedural
details, a list of characteristics of unbiased items, and rating questions.
4. A key feature of the review process was a set of standardized “review questions” that
reviewers were to answer for each item using a prescribed answer form.
5. Although this review was focused primarily on item content that was plausibly sensitive to
race, ethnic and gender differences, reviewers also provided feedback about other item

197
quality characteristics such as distractor quality, clarity of test instructions, susceptibility to
testwise test takers, unnecessary complexity, appropriateness of reading level, excessive
difficulty, and sources of unnecessary confusion.
6. The panel of seven reviewers consisted of two African-Americans, three Hispanics, two
whites, three males and four females. All had relevant professional experience including
three personnel analysts, two university faculty members in counselor education, one
personnel consultant, and a post-doctoral fellow in economics.
7. After training, the reviews appear to have been completed individually by the reviewers.
8. The output of the review panel was used to revise items, instructions and formats.
9. It does not appear that revised items were re-reviewed not does it appear that the reviewers’
output was aggregated into an overall summary of the extent of bias in the items within
subtests.

The developers/authors do not describe this process as a content validation process although it
appears to have important similarities to the concept of item content review described in Study 1 in
that the bias factors appear to have been described as sources of construct-irrelevant variance.
However, this review was considerably narrower in scope than the process envisioned by Study 1.

A second point worth noting is that this process is not unlike similar processes frequently used by
developers of personnel selection tests to minimize the possibility of group-related sources of bias
and invalidity. However, in the personnel selection domain and in related standards and principles,
this type of review is not described as a content validation process. Rather it is described by various
terms including bias review and sensitivity review to indicate that it’s primary focus is on group-related
bias reduction. Semantics notwithstanding, this exercise is directly relevant to the matter of construct
validity.

Certainly, we encourage Sabbarah / MCS to undertake this type of bias or sensitivity review, although
the focus may be less on race/ethnic group differences than on other possible group differences
(male – female?) that, in Saudi Arabia, could be indications of construct-irrelevant variance.
Sabbarah /MCS may want to consider broadening the scope of the review to include any possible
sources of construct-irrelevant variance. In that case, this review would presumably be very similar
to the type of review envisioned by Study 1 and labeled as content validity. If the scope of the review
were broadened, Sabbarah / MCS would need to broaden the range of expertise represented among
the reviewers. In addition to having reviewers with expertise relating to race, ethnic and gender
group differences in test performance, other reviewers would be needed whose expertise was in the
meaning and measurement of cognitive ability,

One final observation about the comparability of the GATB Forms E and F development process and
the GCAT development process. The GATB Forms E and F development process was described to
item developers as a process of developing new items that should be parallel to existing items in the
previous forms. This mayu have even included item-by-item parallelism where a new item was
developed to be parallel to a specific exiting item. In any case, the item specifications and writing
instructions guiding item development for Forms E and F clearly focused on the requirements of item-
level parallelism. This presumably meant that there was less focus on precise definitions of the
target aptitudes. In contrast, the item specifications and writing instructions for GCAT item
development will focus on the clear description of the meaning of the target abilities and the work-like
tasks (Tier 2 content) that are presumed to require them. These same GCAT item specifications and
writing instructions would presumably provide the descriptions of the target abilities necessary to
enable reviewers of item content to make judgments about construct-irrelevant variance. One
implication of this feature of the GCAT item development is that the item specifications themselves
and the processes used to (a) adhere to those specifications during item writing, and (b) confirm that
the specifications were met in drafted items would, itself, constitute content-oriented validity (or
construct validity) evidence, once carefully documented.

198
ORGANIZATIONAL AND OPERATIONAL CONSIDERATIONS

Security Strategy
Background

Our perspective about the important question of MCS’s security strategy is grounded in three
observations. First, even though MCS is unlikely to implement an unproctored administration option,
the protection against cheating and piracy and the effort to detect violations of test security remain
critical requirements for the maintenance of a successful civil service testing program (Drasgow, Nye,
Guo & Tay, 2009). We take this perspective while acknowledging that we have no experience with
the environment in Saudi Arabia with respect to this issue. Second, we believe a model test security
program has been established by SHL in support of Verify, which is administered unproctored, online.
While certain features of SHL’s strategy are not relevant to MCS’s proposed system of proctored test
administration, many could be. Finally, MCS’s civil service testing system is a resource to
government agencies and citizens of Saudi Arabia who will rely on it to be effective and fair. MCS
has a responsibility as a government agency to ensure the effectiveness and fairness of this system,
which is somewhat different responsibility than the commercial interests of publishers of the reviewed
batteries.

Recommendations

Recommendation 15: Implement a comprehensive test security strategy modeled after SHL’s
strategy for supporting its Verify battery. This approach should be adapted to MCS’s proctored
administration strategy and include the following components.

1. In the early stages of the program prior to the online capability to construct nearly unique
online forms for each test taker, use several equivalent forms of the subtests in both paper-
pencil and online modes of administration. Do not rely on 2 fixed forms.
2. Refresh item banks routinely over time with the early objective of building large item banks to
eventually enable nearly unique online forms for every test taker.
3. Monitor item characteristics as applicant data accumulates over time.
4. Use data forensic algorithms for auditing applicants’ answers for indications of cheating
and/or piracy. At the beginning of the program, evaluate the option of entering a partnership
with Caveon to support this form of response checking
a. To support this approach, measure item-level response time with online
administration.
b. Establish a clear policy about steps to be taken in the case of suspected cheating.
5. Establish a technology based practice of conducting routine web patrols of Internet sites for
indications of collaboration, cheating or piracy with respect to the civil service tests

Recommendation 16: Communicate clearly and directly to test takers about their responsibility to
complete the tests as instructed. Require that they sign an agreement to do so. (Note, we
acknowledge that the manner in which this is done, and perhaps even the wisdom of doing this, must
be appropriate to the Saudi Arabian cultural context.

Rationale for Recommendations

Recommendation 15

The proactive and vigilant approach recommended here assumes that cheating and collaborative
piracy will be ongoing threats. The recommended tactics are intended to respond quickly to
indicators of security violations and to prevent potential or imminent threats. The two purposes of this

199
approach are to protect the usefulness of the item banks and to promote the public’s confidence in the
civil service system.

Recommendation 16

The International Test Commission (2006) has established guidelines for unproctored online testing
that call for an explicit agreement with the test taker that he accepts the honest policy implied by the
test instructions. This recommendation suggests to MCS that it benefit from this professional practice
even though the risk of cheating may be somewhat less with proctored administration. Such an
agreement allows MCS to establish clearer actions to be taken in the case of suspected cheating and
it also serves to communicate to all test takers MCS’s proactive approach to the protection of test
security. In the long run, this type of communication will help to develop a reputation for vigilance
that will encourage public confidence in the system.

Item Banking
Background

The Study 2 review of batteries provided relatively little technical information about item banks, even
for Pearson’s Watson-Glaser and SHL’s Verify, both of which rely on an IRT-based item bank
approach to the production of randomized forms for unproctored administration. No publisher that
relies on a small number of fixed forms appears to use item banking in any sense other than,
presumably, for item storage. Even among these fixed form publishers, only PSI has described a
deliberate strategy of accumulating large numbers of items over time to support the periodic
transparent replacement of forms. As a result, this review provides no recommendations relating to
the technical characteristics of bank management and forms production. Rather, we strongly
encourage an IRT bank management approach for somewhat less technical purposes. (We
understand that Study 1 provides a description of technical methods for bank management.)

Recommendations

Recommendation 17: Develop an IRT-based bank management approach to item maintenance and
test form construction. Even if MCS does not implement unproctored administration, a bank
management strategy will support

1. the development of large numbers of forms to support test security strategies,


2. a structured approach to planning for new item development,
3. job-specific adaptation of form characteristics, as appropriate over time, in the manner SHL
adapts Verify forms to job level, and
4. an ongoing item refreshment/replacement strategy that includes new items and, potentially
deactivates compromised items for a period of time.

The bank management approach would be based on IRT estimates of item parameters that replace
classical test theory statistics over time, would enable decisions about and analyses of item retention
thresholds to periodically delete poor performing items from the bank and would require that bank
health metrics be developed

Recommendation 18: Seed the bank with enough items initially to support approximately 4-6 fixed
forms for both paper-pencil and online administration.

1. We do not recommend using an episodic approach of using one or two fixed forms for a
period of time, then replacing them with two new fixed forms. From the beginning, MCS
should administer different forms to different applicants to the extent possible. Note, these
initial forms will be based on CTT item and test statistics and will not provide an opportunity to
apply IRT-based rules form the production of equivalent forms.

200
Recommendation 19: Update and review item characteristics as adequate data allows. Such
updates/reviews may be triggered by time or volume since last update but may also be triggered by
security incidents.

1. Reviews based on security incidents might only involve the investigation of recent response
patterns, which may not be sufficient to update items statistics. Decisions to deactivate
compromised items should be conservative in favor of item security. Compromised items
would be deactivated and moved out of the active bank so they do not influence bank-wide
analyses that might be done to develop plans for new item development or to provide metrics
of overall bank health.
2. Updated item statistics should place more weight on recent data unless specific attributes of
recent data require that they be given less weight.
a. Decisions to remove items from the bank based on recent data should be
conservative in favor of retaining the item until there is a high probability that the item
quality harms test

Rationale for Recommendations

Recommendation 17

In the early stages of item and test development, bank management will have little, if any, practical
value since, presumably, the majority of newly developed items will be used in the initial set of 4-6
fixed forms. (We presume an objective of initial item development is to write a sufficient number of
items to support 4-6 fixed forms.) However, bank management will become a fundamental
component of MCS’s support of its civil service tests and MCS processes should be developed from
the beginning to be consistent with a bank management approach.

Recommendation 18

The proactive approach to test security is the primary consideration in recommending that 4-6 fixed
forms, rather than two as is more typical, be developed and used initially. To the extent that
applicants will communicate with one another following the initial testing session, the early reputation
of the testing system will be bolstered by the applicants experience that many of them competed
different forms.

Recommendation 19

The overall purpose of Recommendation 19 is that MCS should become engaged in monitoring and
tracking item characteristics as soon as adequate data is available. Such early analyses should not
only examine item characteristics and perhaps begin to develop IRT estimates but should also
examine test score distributions in different groups of applicants including applicants for different job
families, applicants with different levels of education, and applicants with different amounts/types of
work experience. In addition, early analyses should also examine subset characteristics and
relationships and should carefully examine the characteristics of hired applicants compared to
applicants who were not hired. An important but unknown (at least to the authors) feature of the
selection system is the levels of selection rates for different jobs and locations. Early analyses of the
effects of selection rates on the benefits of the selection processes will be important.

Staff Support
Background

The Study 2 requirements specified that we provide recommendations about important staffing
considerations. We are happy to provide the recommendations below but these are based on our

201
own professional experience with more and less successful personnel selection programs. These
recommendations were not informed by our review of the seven batteries.

Kehoe, Mol, and Anderson (2010), Kehoe, Brown and Hoffman (2012) and Tippins (2012) summarize
requirements for effectively managed selection programs. While none of these chapters specifically
address civil service selection programs, they are likely to provide useful descriptions of many
organizational considerations important for MCS.

Our suggestions about staff requirements to manage a successful civil service selection program will
be described as a single recommendation with several specific suggestions about key staff roles and
functions. The rationale for each suggested role/function will be embedded in the description of that
role/function.

For the purposes of this set of recommended roles/functions, we are assuming the operational roles
required to manage the Test Centers are part of the whole civil service program organization.

Finally, we are making no assumptions about the current organizational structure or staffing in MCS or
in Sabbarah that may be directly involved in the early development of the subtests and other program
features. Our perspective is that it is unlikely either organization is currently staff and organized in
manner that is optimal for the management of a civil service selection program.

Recommendations

Recommendation 20: Staff key roles/functions with the skills and experience required to support a
successful personnel selection program. While this organization has diverse responsibilities
including research and development, policy and strategy, government leadership, and delivery
operations, its core responsibility is in the domain of personnel selection program management,
including both the technical and management requirements of this core function.

Specific roles/functions are recommended in each of the following subsections. For convenience we
will refer to the organization supporting MCS’s civil service selection program as Saudi Arabian
Selection Program (SASP).

1. Chief Operations Officer. SASP’s chief operations officer who is responsible for the day-to-
day operation should have significant personnel selection expertise with experience in
strategic, technical and operational leadership roles with large scale personnel selection
programs. In our view, this is the single most critical staffing decision for SASP. This role
would support both operations and technical/research support. Presumably, this role would
report to the SASP chief executive officer who we assume to be a senior manager/officer in
the Saudi Arabian government.

We recognize that this may be a very challenging recommendation for MCS. If it is not
feasible to identify a person with these qualifications at the beginning of MCS’s development
work, we strongly encourage MCS to develop a strong consulting relationship with a senior
industrial psychologist in the region to provide oversight and consultation. Absent this
particular professional experience and influence, decisions may be made about the many
recommendations from Study 1 and Study 2, among other things, that are more expensive
than necessary or less effective than is possible.

2. Research and Development Function. SASP should support a strong research and
“product development” function. We anticipate that SASP will be a small organization and, at
least initially, research-related work will be indistinguishable from product development work
since the only technical work underway will have a primary focus on early product

202
development. While this single research and product development function would ideally be
led by an individual with a strong technical background in personnel selection, the early focus
on item and test development and on systems development (administration systems, scoring
systems, applicant data management systems, scheduling and job posting systems, etc.)
could be effectively led by individuals with technical experience and expertise in psychological
test development and by individuals with experience and expertise in employment systems.

This research function is central to the progressive development of the civil service testing
system. This function would be responsible for overseeing validation studies, norming
studies, IRT estimation analyses, overall bank management, utility studies about the impact of
each selection factor on selection decisions, job analysis work, forensic data analyses in
support of the security strategy, and any analyses/investigations relating to the specification
and plans for Phase 2 work on job specific assessments.

3. Security Strategy Leader. Given the strong emphasis we have placed on a proactive and
well-communicated security strategy, we recommend that a SASP establish the role of
Security Strategy Leader at the beginning of the early product development work. The MCS
security strategy will have a significant impact on early decisions about questions such as
numbers of forms, numbers of items, systems capabilities for handling multiple forms, and
systems capabilities for forensic data analyses. The security strategy should be well
established and well represented early in the product development and systems planning
work.

4. Interdisciplinary Team Approach to Work. SASP should implement a team-based


development strategy by which technical/psychometric/testing experts responsible for item
and test development have frequent opportunities to review program plans with IT / Systems /
Programming experts and with operations support managers. While the chief operating
officer should have overall management responsibility for all parts of this team approach,
these three major roles should interact at an organizational level below the chief operating
officer. We further recommend that this interdisciplinary approach to work become part of
the SASP method of work.

5. Test Center Operations. All operational roles responsible for the management of testing
processes themselves – test administration, scoring, materials storage, etc. – should require
that staff are trained and certified as competent for the performance of their roles and
responsibilities.

6. Account Manager. We recommend that SASP establish what might be called Account
Manager roles responsible for SASP’s relationships with the users of its testing services.
The three primary user groups appear to be applicants, hiring agencies, and government
officials who fund and support SASP. It is also possible that a fourth group, schools and
training organizations who prepare student for entry into the civil service workforce, should be
included as well. An Account Manager would be responsible for representing the interests of
the target group to SASP in its strategic planning, research and development and operations.
For example, an organization’s interest in reducing turnover or reducing absenteeism or
improving time to fill vacancies would be represented to SASP initially by the Account
Manager. Similarly, the Account Manager would be responsible for representing SASP’s
interest to the group he represents. For example, the Research leader’s interest in studying
the relationship between subtest scores and subsequent job performance would be supported
by an Account Managers relationship with the organizations in which such research might be
conducted.

7. User Support Desk. We recommend that SASP establish a “User Support Desk” to support
users’ operational needs, questions, and requirements. This would be an information

203
resource applicants and hiring managers/HR managers, primarily, could contact for questions
about the manner in which they may use the civil service system. Presumably, this type of
information would also be available as SASP’s website to handle as many of these inquiries
as possible.

We did not address roles/functions relating to other selection factors such as interviews, experience
assessments, academic success indicators (grade point average, etc.). Although we have assumed
the data produced by those assessment processes is available to SASP’s research and development
function, we have not assumed that SASP has any responsibility for designing those processes or for
implementing them. We have presumed that the hiring organizations will be responsible for
designing and gathering all other assessment tools and processes other than the SASP cognitive
tests.

User Support Materials


Background

In this section, we describe the types of support materials MCS should eventually develop to provide
to each of several user groups. MCS’s role as a government agency providing a service to the
public, to applicants and to the government’s hiring organizations call for a somewhat different set of
support materials than commercial publishers provide for prospective and actual clients. The
recommendations about user support materials are presented in Table 75 where we list the
suggested support materials and provide a brief summary of the purpose of each named document.
We organize this list by the type of user group shown here.

User groups

 HR / Selection Professionals
 Test Centers
 Hiring Organizations
 Applicants
 The Public

HR / Selection Professionals may have any of several roles. They may represent hiring
organizations, they may be researchers with an interest in MCS’s selection research, and they may be
members of the public requesting information on their own behalf or on behalf of other public groups.
In any of these cases, HR / Selection Professionals are regarded as having an interest in technical
information. The Test Centers are part of the civil service testing organization and have an interest in
having necessary instructions and other information needed to carry out their operational roles and
responsibilities. Hiring Organizations are the government organizations whose hiring decisions are
informed by MCS’s testing program. Applicants are those who complete the MCS tests in some
process of applying for jobs. The Public are citizens or organizations such as schools or training
centers in Saudi Arabia who may have some interest or stake in the civil service testing system.

204
Table 75. Recommended support materials for each user group.
User Group
Purpose of Support Material
Name of Document
HR / Selection Professional
 Detailed technical summary of development, reliability, validity and
Technical Manual
usage information
Testing Centers
 Instructions to be read to test takers
Administrative Instructions  Procedural instructions for managing test administration processes
 Scoring instructions
Training Materials  Handbook of training content
FAQs  Answers to questions test takers frequently ask
Hiring Organizations
 Guidelines for using SASP test results and other selection factors
Guidelines for Hiring Decisions
to make selection decisions
 Business oriented summary information about the usefulness of
Validity / Utility Information SASP (and other) selection tests; Includes focus on different
criteria impacted by selection tests
Research Participation Instructions  Instructions for participation in research activities
 Reports describing the test performance (and other selection
Results Reports
factors?) of one or more applicants for organizations vacancies
 Information about the ways a government organization could use
SASP testing services (selection only?)
How to use SASP Resources  Information about specific procedures to be used as part of SASP
testing processes (e.g., submitting requests to fill vacancies,
obtaining applicant test results, etc.)
 Reports describing cost, time, quality of new hires, etc. resulting
Process Management Reports
from one or more episodes of hiring for the mangers organization
Applicants
Information Brochures  General information about SASP and selection tests
FAQs  Answers to many applicant questions
 Instructions about the use and scoring of practice tests
Practice Tests and Guidance
 Practice test materials
 Information about the services SASP provides to applicants
How to use SASP
including published resources
 Report describing test performance and manner in which results
Feedback Report
will be used.
The Public / Schools/Training Centers/Government Officials
Information Brochures  General information about SASP and selection tests
 Information about ways people can prepare to take SASP tests
with an emphasis on practice tests
Practice Tests and Guidance
 Information about the relationship between SASP selection tests
and academic/training
 Report summarizing information of public interest about SASP
Summary Reports testing, e.g., volume of test takers, number of vacancies filled,
selection ratios, descriptions of factors considered, etc.,

Guiding Policies
Background

Although the Study 2 requirements did not specifically request recommendations about guiding
policies, we are including a set of recommended policy topics that, in our experience with selection
programs, are often needed to address a wide range of common issues. These recommended policy
topics are not derived from the review of the seven batteries because test publishers do not have

205
authority over hiring organizations’ management policies. Test publishers may require that
professional standards be met for the use of their tests but they do not impose policy requirements on
organizations.

These recommended policy topics are offered even while we acknowledge that the relationship
between the owner of the civil service testing system (SASP in our hypothetical model) and hiring
organizations is unclear. While we presume the hiring organization will make the hiring decisions and
will have some authority over the manner in which test information is used, it is not clear to what
extent the governing policies will be “owned” by SASP or by the hiring organizations, or both. Also, it
is not clear who has authority over all the selection data entering into a civil service hiring decision, or
who gathers it. For example, are the test results produced by SASP Test Centers downloaded to
hiring organizations which then have authority over the use of the test data? Does SASP have any
role gathering and recording other selection information such as experience records, interviews,
resume information, and the like? Who assembles the composite scores that combine SASP test
results with scores from other sources? While these unknowns affect who the policy owners would
be – some SASP, some the hiring organizations – it is very likely that all of the recommended policy
topics should be addressed.

We recommend policy topics without recommending the specific policies themselves. For example,
the first policy topic listed in Table 76 below is the question of eligibility to take a SASP cognitive
ability test. Our recommendation is that a policy should be developed to answer this question. That
eventual policy may take several forms ranging from very few constraints on who may take a test to
other more restrictive eligibility requirements, for example, a person must be a basically qualified
applicant for a job with a current vacancy the hiring organization is actively attempting to fill.

Chapters in recent personnel selection handbooks address many of these policy issues and other
specific selection program practices. (Kehoe, Brown and Hoffman, 2012; Kehoe, Mol and Anderson,
2010, Tippins, 2012)

Recommendations

Table 76 describes all the recommended policy topics organized by the “audience” for the various
policies. By “audience” we mean the user group whose actions and choices are directly affected by
the policy. For example, a policy about eligibility for testing is a policy about applicants’ access to the
civil service testing system, even though Test Centers and Hiring Organizations would presumably
have some role in managing and enforcing this policy. Admittedly, the organization of Table R3 into
three non-overlapping user groups is somewhat arbitrary and imperfect but it does serve a useful
purpose of distinguishing between types of policies.

206
Table 76. Recommended topics for which SASP should develop guiding policies.
Audience
What the Policy Should Address
Policy Topic
Applicants
 What conditions must an applicant satisfy to be eligible to take a test?
o Basically qualified for the vacancy?
Eligibility for testing, application o Recency since last test result?
o Result of previous test events?
o Must a vacancy must be open that requires the test?
 What conditions must an applicant satisfy to be eligible to retake a test?
o Does previous result(s) on same test matter?
Retesting
o Does length of time since last result on same test matter?
o Does number of attempts at same test matter?
 Once a test result has been obtained, how long is that result usable by
Duration of Test Result hiring organizations?
o 6 months, 1 year, 5 years; Indefinitely?
 If an applicant has previously worked successfully in a job for which the
Grandfathering
test is now required, is that applicant required to take the test?
 What conditions may exempt an applicant from the requirement to take
test?
Exemptions/ Waivers o Highly desired experience?
o Hiring manager preference?
 Who has the authority to exempt an applicant from a test requirement?
 What information are applicants given about their own test result?
Feedback o Raw score? Percentile score? Score range?
o Interpretation of score (E.g., High, Average, Low?)
 What advice are applicants given about test preparation options?
o Commercial sources for test prep course?
Test Preparation o Information about effectiveness of test preparation, such as
practice
o Interpretation of own practice test results?
 What conditions must an individual satisfy to be given access to practice
tests?
Access to practice material o None?
o Must be an applicant for an open vacancy?
o Must not have taken the test before?
Hiring Organizations
 Must all locations of same job use the test in the same way?
 Must a test be used in same way for all applicants for same job?
Selection standards  Who has the authority to decide whether a specific test is a required
selection factor?
 May role / weight of test result be temporarily change based on
employment market conditions?
 Who may make a hiring decision that is based, at least in part, on a test
result?
 What information other than the test result may be used? Who may
Roles in Hiring Decisions
decide?
 Must those who make test-based hiring decisions be trained and certified
to do so?
 Who may have access to an applicant’s test result(s)?
Access to Test Results
 For what purpose?
Testing Centers
 What procedures are required to confirm applicant ID?
 How are late-arriving applicants handled?
Applicant management and
 What steps should be taken if cheating is observed or suspected?
identification
 What steps should be taken if an applicant leaves a test session before
completing the test?

207
 May instructions be repeated / reread?
Use of Test Instructions  May test administrators offer interpretations of test administration
instructions?
 How are ambiguous answer marks to be handled?
Scoring
 Must manually calculated scores be independently checked?
 If test result data is manually entered, when does that happen? By
Data Management
whom?
Record Keeping  How are test material handled following completion of a test?
Feedback  May test administrators provide individual feedback to test takers?
 What are the rules for test materials security?
Material Security
 What are the consequences for violations of security rules?

208
SECTION 8: REFERENCES
Andrew, D. M., Paterson, D. G., & Longstaff, H. P. (1933). Minnesota Clerical Test. New York:
Psychological Corporation.

Baker, F. (2001). The basics of item response theory. http://edres.org/irt/baker/

Bartram, D. (2005). The Great Eight Competencies: A criterion-centric approach to validation. Journal
of Applied Psychology, 90, 1185-1203.

Berger, A. (1985). [Review of the Watson-Glaser Critical Thinking Appraisal ] In J.V. Mitchel, Jr. (Ed.),
The ninth mental measurements yearbook, (pgs. 1692-1693) Lincoln, NE. Buros Institute of
Mental Measurements.
rd
Bolton, B. (1994). A counselor’s guide to career assessment instruments. (3 ed.). J. T. Kapes, M. M.
Mastie, and E. A. Whitfield (Eds.). The National Career Development Association: Alexandria,
VA.

Cizek, G. J., & Stake, J. E. (1995). Review of the Professional Employment Test. Conoley, J. C., &
Impara, J. C. (Eds.). The twelfth mental measurements yearbook. Lincoln, NE: Buros Institute
of Mental Measurements.

Cohen, J., & Cohen, P. (1983). Applied multiple regression/correlation analysis for the behavioral
sciences (2nd Ed.). Hilllsdale, New Jersey: Erlbaum.

Corel Corporation (1993). CorelDRAW! user’s manual – version 4.0. Ontario, Canada: Author.

Crites, J. O. (1963). Test reviews. Journal of Counseling Psychology, 10, 407-408.

Dagenais, F. (1990). General Aptitude Test Battery Factor Structure for Saudi Arabian and American
samples: A comparison. Psychology and Developing Societies, 2, 217-240.

Dale, E., & O’Rourke, J. (1981). The living word vocabulary. Chicago: Worldbook-Childcraft
International, Inc.

Differential aptitude tests for personnel and career assessment: Technical manual (1991). The
Psychological Corporation: Harcourt Brace & Company: San Antonio, TX.

Drasgow, F., Nye, C. D., Guo, J. & Tay, L. (2009). Cheating on proctored tests: The other side of the
unproctored debate. Industrial and Organizational Psychology: Perspectives on Science and
Practice, 2, 46-48.

du Toit, M. (2003). IRT from SSI: BILOG-MG MULTILOG, PARSCALE, TESTFACT. Lincolnwood, IL:
Scientific Software International.

Engdahl, B., & Muchinsky, P. M. (2001). Review of the Employee Aptitude Survey. Plake, B. S., &
Impara, J. C. (Eds.). (2001). The fourteenth mental measurements yearbook. Lincoln, NE:
Buros Institute of Mental Measurements.

Ekstrom, R.B., French, J. W., Harman, H.H., Dermen, D. (1976). Manual for Kit of Factor-Referenced
Cognitive Tests, Educational Testing Service. Princeton, NJ.

French, J. W. (1951). The description of aptitude and achievement tests in terms of rotated factors.
Psychometric Monographs, No. 5.

French, J. W., Ekstrom, R. B., & Price, L. A. (1963). Manual for kit of reference tests for cognitive
factors. Princeton, NJ: Educational Testing Service.

Geisinger, K. F., (1998). [Review of the Watson-Glaser Critical Thinking Appraisal, Form S ] In J.V.
Mitchel, Jr. (Ed.), The thirteenth mental measurements yearbook, Lincoln, NE. Buros Institute
of Mental Measurements.
209
Guilford, J. P. (1956). The structure of intellect. Psychological Bulletin, 53(4), 267-293.

Gulliksen, H., & Wilks, S. S. (1950). Regression test for several samples. Psychometrika, 15, 91-114

Hartigan, J.A. & Wigdor, A.K. (1989). Fairness in employment testing: Validity generalization, minority
issues and the General Aptitude Test Battery. Washington, DC: National Academy Press.

Hausforf, P. A., LeBlanc, M. M., & Chawla, A. (2003). Cognitive Ability Testing and Employment
Selection: Does Test Content Relate to Adverse Impact? Applied H.R.M. Research, 2003,
Volume 7, Number 2, 41-48

Helmstadter, G. C., (1985). [Review of the Watson-Glaser Critical Thinking Appraisal ] In J.V. Mitchel,
Jr. (Ed.), The ninth mental measurements yearbook, (pgs. 1693-1694) Lincoln, NE. Buros
Institute of Mental Measurements.

Hunter, J. E. (1983a). USES test research report no. 44: The dimensionality of the General Aptitude
Test Battery (GATB) and the dominance of general factors over specific factors in the
prediction of job performance for the U.S. Employment Service. Division of Counseling and
Test Development Employment and Training Administration, U.S. Department of Labor:
Washington, D.C.

Hunter, J. E. (1980). Validity generalization for 12,000 jobs: An application of synthetic validity and
validity generalization to the General Aptitude Test Battery (GATB). Washington, DC: U.S.
Employment Service, U.S. Department of Labor.

Hunter, J. E. (1983b). USES test research report no. 45: Test validation for 12,000 jobs: An
application of job classification and validity generalization analysis to the General Aptitude
Test Battery. Division of Counseling and Test Development Employment and Training
Administration, U.S. Department of Labor: Washington, D.C.

Hunter, J. E. & Schmidt, F. L. (1990). Methods of meta-analysis: Correcting error and bias in
research findings. Thousand Oaks, CA. Sage Publications.

International Test Commission (2006). International guidelines on computer-based and Internet


delivered testing. International Journal of Testing, 6, 143-172.

Ivens, S. H., (1998). [Review of the Watson-Glaser Critical Thinking Appraisal, Form S ] In J.V.
Mitchel, Jr. (Ed.), The thirteenth mental measurements yearbook, Lincoln, NE. Buros Institute
of Mental Measurements.

Kapes, J. T., Mastie, M. M., & Whitfield, E. A. (1994). A counselor’s guide to career assessment
rd
instruments (3 ed.). The National Career Development Association, A Division of The
American Counseling Association: Alexandria, VA.

Keesling, J. W. (1985). Review of USES General Aptitude Test Battery. Mitchell, J. V., Jr. (Ed.). The
ninth mental measurements yearbook. Lincoln, NE: Buros Institute of Mental Measurements.

Kehoe, J. F., Brown, S. & Hoffman, C. C. (2012). The life cycle of successful section programs. In N.
Schmitt (Ed.). The Oxford Handbook of Personnel Selection and Assessment. (pp. 903-936).
New York, NY., Oxford University Press.

Kehoe, J. F., Mol, S. T. & Anderson, N. R. (2010). Managing sustainable selection programs. In J. L.
Farr and N. T. Tippins (Eds.), Handbook of Employee Selection. (pp. 213-234), New York,
NY., Routledge,

Kolz, A. R., McFarland, L. A., & Silverman, S. B. (1998). Cognitive ability and job experience as
predictors of work performance. The Journal of Psychology, 132, 539-548.

Loo, R & Thorpe, K., (1999). A psychometric investigation of scores on the Watson-Glaser Critical
Thinking Appraisal new Form S. Educational and Psychological Measurement, 59, 995-1003.

210
Lovell, C. (1944). The effect of special construction of test items on their factor composition.
Psychological Monographs, 56 (259, Serial No. 6).

MacQuarrie, T. W. (1925). MacQuarrie Test for Mechanical Ability. Los Angeles: California Test
Bureau.

McKillip, R. H., Trattner, M. H., Corts, D. B., & Wing, H. (1977). The professional and administrative
career examination: Research and development. (PRR-77-1) Washington, DC: U.S. Civil
Service Commission.

McLaughlin, G. H. (1969). SMOG Gradings: A new readability formula. Journal of Reading, 12, 639-
646.

Mellon, S. J., Daggett, M., & MacManus, V. (1996). Development of General Aptitude Test Battery
(GATB) Forms E and F. Division of Skills Assessment and Analysis Office of Policy and
Research Employment and Training Administration U.S. Department of Labor: Washington,
D.C.

O'Leary, B. (1977). Research base for the written test portion of the Professional and Administrative
Career Examination (PACE): Prediction of training success for social insurance claim
examiners. (TS-77-5) Washington, DC: Personnel Research and Development Center, U.S.
Civil Service Commission.

O'Leary. B., & Trattner, M. H. (1977). Research base for the written test portion of the Professional
and Administrative Career Examination (PACE): Prediction of job performance for internal
revenue officers. (TS-77-6) Washington, DC: Personnel Research and Development Center,
U.S. Civil Service Commission.

Otis, A. S. (1943). Otis Employment Test. Tarrytown, NY: Harcourt, Brace, and World, Inc.

Pearlman, K. (2009). Unproctored Internet Testing: Practical, Legal and Ethical Concerns. Industrial
and Organizational Psychology: Perspectives on Science and Practice, 2, 14-19.

Pearlman, K., Schmidt, F. L., & Hunter, J. E. (1980). Validity generalization results for tests used to
predict job proficiency and training success in clerical occupations. Journal of Applied
Psychology, 65, 373-406.

Postlethwaite, B. E., (2012). Fluid ability, crystallized ability, and performance across multiple
domains: A meta-analysis. Dissertation Abstracts International Section A: Humanities and
Social Sciences, 72 (12-A), 4648.

Ruch, W. W., Stang, S. W., McKillip, R. H., & Dye, D. A. (1994). Employee aptitude survey technical
nd
manual (2 ed.). Psychological Services, Inc.: Glendale, CA.
nd
Ruch, W. W., Buckly, R., & McKillip, R. H. (2004). Professional employment test technical manual (2
ed.). Psychological Services, Inc.: Glendale, CA.

Sackett, P.R. & Wilk, S.L. (1994). Within-group norming and other forms of score adjustment in
preemployment testing. American Psychologist, 49 (11), 929-954.

Schmidt, F. L., & Hunter, J. E. (1998). The validity and utility of selection methods in personnel
psychology: Practical and theoretical implications of 85 years of research findings.
Psychological Bulletin, 124(2), 262-274. doi:10.1037/0033-2909.124.2.262

Schmidt, F. L., (2012). Cognitive tests used in selection can have content validity as well as criterion
validity: A broader research review and implications for practice. International Journal for
Selection and Assessment, 20, 1-13.

Segall, D.O., & Monzon, R.I. (1995). Draft report: Equating Forms E and F of the P&P-GATB. San
Diego, CA: Navy Personnel Research and Development Center.

211
Siegel, L. (1958). Test reviews. Journal of Counseling Psychology, 5, 232-233.

Stelly, D. J. & Goldstein, H. W., (2007). Applications of content validation methods to broader
constructs. In S. M. McPhail (Ed.), Alternative validation strategies: Developing new and
leveraging existing validity evidence (pp. 252-316). San Francisco, CA. Jossey-Bass.

Sullivan, E. T., Clark, W. W., & Tiegs, E. W. (1936). California Test of Mental Maturity. Los Angeles:
California Test Bureau.

Thurstone, L. L., & Thurstone, T. G. (1947). SRA Primary Mental Abilities. Chicago: Science
Research Associates.

Tippins, N. T. (2012). Implementation issues in employee selection testing. In N. Schmitt (Ed.). The
Oxford Handbook of Personnel Selection and Assessment. (pp. 881-902). New York, NY.,
Oxford University Press,

Trattner, M. H., Corts, D. B., van Rijn, P., & Outerbridge, A. M. (1977). Research base for the written
test portion of the Professional and Administrative Career Examination (PACE): Prediction of
job performance for claims authorizers in the social insurance claims examining occupation.
(TS-77-3) Washington, DC: Personnel Research and Development Center, U.S. Civil Service
Commission.

Viswesvaran, C. (2001). Assessment of individual job performance: A review of the past century and
a look ahead. In N. Anderson, D. Ones, H. K. Sinangil, & C. Viswesvaran (Eds.). Handbook
of Industrial, Work & Organizational Psychology. (pp. 110-126). Thousand Oaks, California,
Sage Publications.

Wang, L. (1993). The differential aptitude test: A review and critique. Paper presented at the annual
meeting of the Southwest Educational Research Association (Austin, TX, January 28-30).

Watson, G., & Glaser, E. M. (1942). Watson-Glaser Critical Thinking Appraisal. Tarrytown, NY:
Harcourt, Brace, and World, Inc.

Wigdor, A.K. & Sackett, P.R. (1993). Employment testing and public policy: The case of the General
Aptitude Test Battery. In H. Schuler, J.L. Farr, & M. Smith (Eds.). Personnel Selection and
Assessment: Individual and Organizational Perspectives. Hillsdale, NJ: Lawrence Erlbaum
Associates.

Wilson, R. C., Guilford, J. P., Christianson, P. R., & Lewis, P. J. (1954). A factor analytic study of
creative-thinking abilities. Psychometrika, 19, 297-311.

Willson, V. L., & Wing, H. (1995). Review of the Differential Aptitude Test for Personnel and Career
Assessment. Conoley, J. C., & Impara, J. C. (Eds.). (1995). The twelfth mental measurements
yearbook. Lincoln, NE: Buros Institute of Mental Measurements.

Woehr, D. J. & Roch, S. (2012) Supervisory performance ratings. . In N. Schmitt (Ed.). The Oxford
Handbook of Personnel Selection and Assessment. (pp. 517-531). New York, NY., Oxford
University Press.

212