Sie sind auf Seite 1von 48

Educational and Psychological Measurement and

Evaluation

TEST VALIDITY
CONTENT
Messick (1989) unified view of validity
 Content Validity
 Construct Validity
 Criterion Validity
 Consequential Validity
Factors affecting validity
 Instructional Procedures
 Test Administration & Scoring Procedures
 Test Itself
 Student Characteristics
 Functioning Content & Teaching Procedure
Relationship between Validity and reliability
INTRODUCTION
 Classroom assessment consisit of determining purpose
and learning target related standards, systematically
obtaining information from students, interpretening the
information collected, and using the information.
 There are several criteria that determine the quality of
the classroom assessment.
 Validity is one of the most important criteria for ensuring
high quality classroom assessment .
What is validity?
 Validity is an integrated evaluative judgment of the degree to
which empirical evidence and theoretical rationales support
the adequacy and appropriateness of interpretations and
actions based on test scores or other modes of assessment
(Messick, 1993).

 Validity is the most important type of quality control


procedure for measuring assessment (Thorndike, 1997).

 The validity of a test is the extent to which it measures what it


is design to measure.

(Messick, 1993)
 Validity is used to determine whether the test measure what
it claims to measure.

 Validity is concerned with whether the information being


gathered is relevant to the decision that needs to be made.

 If a test is not valid or does not measure what it claims to


measure, then the test has no meaningful use.

(Steven, 2005)
Types of validity

Content Construct Criterion Consequential


validity validity validity Validity

Concurrent Predivtive
Validity Validity
Content Validity
 The extent to which the assessment is representative of the domain on
interest.
 Content validity usually is based on wheter the assessment
appropriately measures or samples the domain of instructional
material presented to the student.
 Example :
• Suppose you wanted to test for everything that Sixth-grade
students learn in a four week unit about insects. But it will take a
long time for student to complete the test.
• The solution is, select a sample of what has been taught to assess
and then you use student’s achievement on this sample to make
inferences about knowledge of the entire domain of content,
reasoning and other objectives.
• For example, if student correctly answer 85% of the item, it means
that the student knows 85% of the content in the entire unit.
• If your sample is judge to be representative of the domain, then
you have the content-related evidence for validity.

(Steven, 2005)
How to determine an adequate sampling?

 Using a test blue print or table of specification to further delineate what


objectives you intend to assess and what important from the content
domain.
 Table of spesification is a two way grid that shows the content and types of
learning targets represented in your assessment.

Bloom’s Taxonomy
Major Content Understand

Application
Remember

Evaluate
Analysis
Total

Create
Areas

Topic 1

Topic 2

.......

Total No of item
Total no of item / % No/(%) No/(%) No/(%) No/(%) No/(%) No/(%)
/100%

(McMillan, 2007)
Construct Validity
 The extent to which the assessment is a meaningful measure of
an unobservable trait or characteristic such as honesty, attitude
and intelligence.
 Example :
 Let’s say you want to create an assessment that measures the
construct of a job stress. The score of the job stress assessment
should relate to various prediction indicated by theories about
job stress.
 For example, job stress test theories indicate a relationship
between job stress and variables of absenteeism, job turnover,
or peer rating job stress.
 If the job stress test is able to predict these variables, then the
test is considered to have construct validity.

(McMillan, 2007)
Criterion Validity
 The relationship between an assessment and another
measure of the same trait.
 With this type of of validity an assessment is given to a
group of students. Their assessment scores are then
compared to their scores on an external criterion
appropriately related to what the test claims to measure.
 Criterion-related validity may be measured by comparing
the student’s score on the achievement with an older/
already established achievement test.
 If the new test score are closely related to the established
test, then the new test is considered valid.

(McMillan, 2007)
Type of criterion validity :
1) Concurrent Validity

 When the criterion is something that is happening or


being assessed at the same time as the construct of
interest, it is called concurrent validity.

 Example :
Researcher administers a self-esteem inventory to a
group of 8th graders and compares their scores on it with
their teacher’s ratings of students’ self-esteem obtained
at about the same time.
Type of criterion validity :
2) Predictive Validity

Obtain predictive evidence of validity by measuring your


participants at one point in time on your test and then, at a
future time, measuring them on the criterion measure.
Take more time and effort than concurrent evidence, but it
can provide superior evidence that your test does what you
want it to do.
For example, if the trial results in math scores have a strong
relationship with SPM grade math, then math trial test is said
to have predictive validity of higher. This significant
achievement in experimental test can predict the results of
SPM.
Consequential
Validity
 Messick (1989) originally introduced consequences to the
validity argument. Later, Shepard (1997) broadened the
definition by arguing one must investigate both
positive/negative and intended/unintended consequences of
score-based inferences to properly evaluate the validity of the
assessment system.
 The consequential aspect of validation assesses the values of
score interpretations as they relate to bias, fairness, and social
consequences.
 Consequential validity describes the intended and unintended
effects of assessment on instruction or teaching and student
learning.
 Discussions of the consequences of assessments usually
distinguish between intended positive consequences and
unintended negative consequences.

(Sax, 1997)
 Although both types of consequences are relevant to the
evaluation of the validity of an assessment program, what is
a positive effect for one observer may be a negative effect
for someone else.
 Example : Narrowing the curriculum
 Narrowing may be viewed as a positive outcome by
those who want instruction to have a sharper focus on
the material in the state content standards, but it may be
viewed as a negative outcome for those that worry that
important concepts and content areas (e.g., history or
music) may be short changed.
 Because there is not universal agreement about whether
particular outcomes are positive or negative, the discussion
below is not separated into intended positive and
unintended negative consequences.
Intended Consequences
Lane and Stone (2002) suggest that assessments are intended
to impact:
◦ Student, teacher, and administrator motivation and effort;
◦ Curriculum and instructional content and strategies;
◦ Content and format of classroom assessments;
◦ Improved learning for all students;
◦ Professional development support;
◦ Use and nature of test preparation activities; and
◦ Student, teacher, administrator, and public awareness and
beliefs about the assessment, criteria for judging
performance, and the use of assessment results.
Unintended Consequences
At times, however, Lane and Stone (2002) propose unintended
consequences are possible such as:
◦ Narrowing of curriculum and instruction to focus only on the
specific learning outcomes assessed;
◦ Use of test preparation materials that are closely linked to the
assessment without making changes to the curriculum and
instruction;
◦ Use of unethical test preparation materials; and
◦ Inappropriate use of test scores by administrators.
FACTORS OF VALIDITY
Instructional Procedures

Test Administration &


Scoring Procedures

Test Itself

Student Characteristics
Functioning Content & Teaching
Procedure
Instructional
Procedures
With educational tests, in addition to the
content of the test influencing validity, the
way the material is presented can influence
validity.

For example : consider a test of critical


thinking skills. If the students were coached
and given solutions to the particular
problems included on a test, validity would
be compromised. This is potential problem
when teachers “teach the test”

Reynolds, C. R., Livingstone, R. B., & Willson, V. (2009). Measurement and assessment in education
(2nd ed.).
Test Administration &
Scoring Procedures
In terms of
Deviation from standard administration, failure to
administrative and provide the appropriate
scoring procedures can instruction or follow strict
undermine validity. time limits can lower
validity.

Whether it is a teacher
made test or standardized
test, adverse physical and
psychological conditions
during testing time may
affect the validity.

Reynolds, C. R., Livingstone, R. B., & Willson, V. (2009). Measurement and assessment in education
(2nd ed.).
The test administration and scoring procedure may
also affect the validity of the interpretations from the
results. For instance, in teacher-made tests factors like
insufficient time to complete the test, unfair help to
individual students, cheating during the examination,
and the unreliable scoring of essay answers might lead
to lower the validity.

Similarly in standardized tests the lack of following


standard directions and time limits, unauthorized help
to students and errors in scoring, would tend to lower
the validity.

Reynolds, C. R., Livingstone, R. B., & Willson, V. (2009). Measurement and assessment in education
(2nd ed.).
Test Itself
Length of the test
• A test usually represents a sample of many questions. If
the test is too short to become a representative one, then
validity will be affected accordingly. Homogeneous
lengthening of a test increases both validity and
reliability.

Unclear direction

• If directions regarding how to respond to the items,


whether it is permissible to guess and how to record the
answers, are not clear to the pupil, then the validity will
tend to reduce.

Reynolds, C. R., Livingstone, R. B., & Willson, V. (2009). Measurement and assessment in education
(2nd ed.).
Reading vocabulary and sentence structures which
are too difficult
• The complicated vocabulary and sentence structure
meant for the pupils taking the test may fail in
measuring the aspects of pupil performance; thus
lowering the validity.

Inappropriate level of difficulty of the test items


• When the test items have an inappropriate level of
difficulty, it will affect the validity of the tool. For
example, in criterion-referenced tests, failure to match
the difficulty specified by the learning outcome will
lower the validity.

Reynolds, C. R., Livingstone, R. B., & Willson, V. (2009). Measurement and assessment in education
(2nd ed.).
Poorly constructed test items

• The test items which provide unintentional clues to the


answer will tend to measure the pupils’ alertness in
detecting clues as well as the aspects of pupil
performance which ultimately affect the validity.

Ambiguity
• Ambiguity in statements in the test items leads to
misinterpretation, differing interpretation and confusion.
Sometimes it may confuse the better students more than
the poorer ones resulting in the discrimination of items
in a negative direction. As a consequence, the validity of
the test is lowered.

Reynolds, C. R., Livingstone, R. B., & Willson, V. (2009). Measurement and assessment in education
(2nd ed.).
Student Characteristics
There are certain personal factors which influence the pupils
response to the test situation and invalidate the test
interpretation. The emotionally disturbed students, lack of
students’ motivation and students’ being afraid of test situation
may not respond normally and this may ultimately affect the
validity.

Response set also influences For example : if an


the test results. It is the test examinee experiences high
taking habit which affects levels of test anxiety or is
the pupils score. If a test is not motivated to put forth
used again and again its a reasonable effort, the
validity may be reduced. result may be distorted.

Reynolds, C. R., Livingstone, R. B., & Willson, V. (2009). Measurement and assessment in education
(2nd ed.).
Functioning Content &
Teaching Procedure
In achievement testing, the functioning content of test
items cannot be determined only by examining the
form and content of the test. The teacher has to teach
fully how to solve a particular problem before
including it in the test.

Test of complex learning outcomes seem to be valid if


the test items function as intended. If the students have
previous experience of the solution of the problem
included in the test, then such tests are no more a valid
instrument for measuring the more complex mental
processes and they thus, affect the validity.

Reynolds, C. R., Livingstone, R. B., & Willson, V. (2009). Measurement and assessment in education
(2nd ed.).
RELATIONSHIP BETWEEN
RELIABILITY AND
VALIDITY
 Validity and reliability are closely related.
 A test cannot be considered valid unless the
measurements resulting from it are reliable.
 Likewise, results from a test can be reliable and
not necessarily valid.

(Messick, 1989)
A test cannot be valid unless it is reliable.
• If a test does not measure something consistently, it follows that
it cannot always be measuring it accurately.
It is quite possible for a test to be reliable but invalid.
• A test can consistently give the same results, although it is not
measuring what it is supposes to.

(Alderson, 1995)
Reliability And Validity
if then

Unreliable Test validity is undermined

Reliable but not valid Test is not useful

Unreliable and invalid Test is definitely not useful

Reliable and valid Test can be used with good result


ACTIVITY
STATEMENT TRUE / FALSE

Validity is used to determine whether the test measure what it


claims to measure

Criterion validity is the extent to which the assessment is


representative of the domain on interest.

Content validity is the extent to which the assessment is a


meaningful measure of an unobservable trait or characteristic such
as honesty, attitude and intelligence.

A test that is being assessed at the same time as the construct of


interest, it is called concurrent validity.
The complicated vocabulary and sentence structure meant for the
pupils taking the test may fail in measuring the aspects of pupil
performance, thus lowering the validity is the factor of teaching
procedures.
STATEMENT TRUE / FALSE

If unreliable and invalid then test is definitely not useful.

A test cannot be considered valid unless the measurements resulting


from it are reliable.

If unreliable then test is not useful.

The test administration and scoring procedure may also affect the
validity of the interpretations from the results.
Q&A
Question :
Can a test be valid if it is not reliable?

Answer :
No, a test can not be valid if it is not reliable.
If a test does not measure something consistently, it follows that it
cannot always be measuring it accurately.
Question : If it possible that the examination has high reliability
but a low validity ? How can we avoid this kind of situation?

Answer :
Yes, it is possible. To avoid this kind of situation, you have to
look back at the factors that affecting the validity.
Question :
What's the difference between validity and reliability?

Answer :
Reliability refers to how consistent data is. If a method is
reliable, results will be consistent so the measure will
produce the same results on different occasions.
Validity refers to the ability to measure what was set out to
be measured. A test is valid when it measures what it
intends to measure.
Reference
Airasian, P. W. & Russel, M. K. (2008). Classroom Assessment: Concept and Applications (6th ed.). New York, US :
McGraw-Hill Publication.

Alderson, J. C., Clapham, C. and Wall, D. (1995). Language Test Construction and Evaluation. Cambridge :
Cambridge University Press.

McMillan, J.H. (2007). Classroom Assessment (4th ed.). Boston, MA : Pearson Publication.

Messick, S. (1993). Foundations of validity: Meaning and Consequences in Psychological Assessment. Princeton,
NJ : ETS Publication

Messick, S. (1989). Educational Measurement (3rd ed.). New York, US : American Council on Education and
Macmillin

Reynolds, C. R., Livingstone, R. B., & Willson, V. (2009). Measurement and assessment in education (2nd ed.).
Upper Saddle River, NJ : Pearson Publication.

Sax, G. (1997). Principles of Educational and Psychological Measurement and Evaluation. Belmont, CA :
Wadsworth Publisher.

Shepard, L.A. (1997). The Centrality of Test Use and Consequences for Test Validity

Steven, R. B. (2005). Classroom Assessment: Issues and Practices. Boston, MA : Pearson Publication.