Sie sind auf Seite 1von 41

SGDY 5063

Educational & Psychological Measurement & Evaluation

Presented by: Mohd Hafiz Bin Mohd Salleh (810350) Syahrul Azrin Binti Ghafar (810398) Aza Nurain Binti Azmey (811333) Nur Zahirah Binti Othman (811562)

Supervised by: Tuan Haji Shuib Bin Hussain

PRESENTATION OVERVIEW
What is reliability? Practical method of estimating reliability

The reliability of Criterion-Referenced and Mastery Tests Condition affecting reliability coefficients
The standard error of measurement Source of error Error components of different reliability coefficients The reliability of individual and group scores The reliability of difference scores Ways to improve Norm-Referenced and Criterion-Referenced Reliability

Introduction to Test Reliability

Introduction to Test Reliability


Are the scores consistent? Are they stable?

WHAT IS RELIABILITY?
O Reliability refers to the reliability of a test score

or set of test scores, not the reliability of the test. O Reliability refers to the consistency of a measure. A test is considered reliable if we get the same result repeatedly. (C. Kendra, 2012) O Reliability is not the same as validity validity asks Does a test measure what is suppose to?

Technically reliability is defined as = True variance divided by obtained variance

Practical Methods of Estimating Reliability

O Because teachers do not know an individuals true

score, they cannot know the amount of error for any given person. O We can estimate the effect of chance on measurements in general. Consistency of measurement means high reliability The degree of consistency can be determined by a correlation coefficient

Practical Methods of Estimating Reliability


Consistency of measurement means high reliability The degree of consistency can be determined by a correlation coefficient Correlation is the generic term for all measures of relationship and reliability may be thought of as a special type of correlation that measures consistency of observation scores

Practical Methods of Estimating Reliability


O The perfect test would be unaffected by the sources of

unreliability and on this perfect test each examinee should get his or her true score. Unfortunately, we know the observed score we get was likely affected by one or more of the sources of unreliability. O So, our observed score is likely too high or to low. The difference between the observed score and the true score we call the error score; and this score can be positive or negative. O We can express this mathematically as: O True Score = Obtained Score +/- Error O T = O +/- E (or, looking at it another way, O = T +/- E)

Practical Methods of Estimating Reliability

Test Retest (stability)

Alternate Forms (equivalent)

Internal Consistency

10

Practical Methods of Estimating Reliability


O The test-retest reliability method is one of the simplest

ways of testing the stability and reliability of an instrument over time. (Martyn Shuttleworth , 2009) O For example, if a group of students takes a test, we would expect them to show very similar results if they take the same test a few months later. This definition relies upon there being no confounding factor during the intervening time interval. O Instruments such as IQ tests and surveys are prime candidates for test-retest methodology, because there is little chance of people experiencing a sudden jump in IQ or suddenly changing their opinions. O Educational tests are often not suitable, because students will learn much more information over the intervening period and show better results in the second test.

11

Practical Methods of Estimating Reliability


O An alternate form reliability is the authenticity stablished by

carrying out two different forms of the same test to the same individuals. (psychwiki.com, 2012) O This method is convenient to avoid the problems that come from the test-retest method. With the alternate form reliability method, an individual is tested on one form of the test, and then again on a comparable second form, the in-between time is about one week. O This method is used more than the test-restest method because it has fewer related problems, including an abundance reduction in practice effects.

12

Practical Methods of Estimating Reliability


O A procedure for studying reliability when the focus of the

investigation is on the consistency of scores on the same occasion and on similar content, but when conducting repeated testing or alternate forms testing is not possible.
Kuder-Richardson
a series of formulas based on dichotomously scored items

Coefficient alpha
Cronbachs (most widely used as can be used with continuous item types)

Split-half (odd-even )
Spearman-Brown correction to apply to full test (easiest to do and understand)

13

Reliability of Classroom Tests


O We would recommend doing Split-Half Reliability. O Step 1 Split the test into two parts (odd even). O Step 2 Use Pearson Product Moment Correlation -

Ungrouped Data to determine rxy (rxy represents the correlation between the two halves of the scale). By doing the split-half we reduce the number of items which we know will automatically reduce the reliability, O Step 3 To estimate reliability of whole test then use the Spearman Brown correction formula

rsb = 2rxy /(1+rxy)


where rsb is the split-half reliability coefficient

14

As a Teacher, What Do We Need to Know Most About Reliability


O For tests we can create: O Increasing number of items increases reliability. O Moderate difficulty level increases reliability. O Having items measuring similar content

increases reliability.
O For standardized tests we can use:
O Look for each tests published reliability data.

O Use the published reliability coefficient to

determine the Standard Error of Measurement (abbreviated SEM) found in the data O See the following illustration

15

Estimating Reliability
O If we could re-administer a test to one

person an infinite number of times, what would expect the distribution of their scores to look like?
O Answer: The Bell Shaped Curve.

16

THE RELIABILITY OF CRITERION-REFERENCED TESTS


O Because the purpose of a criterion-referenced test is quite

different from that of a norm-referenced test, it should not be surprising to find that the approaches used for reliability are different too. With criterion-referenced tests, scores are often used to sort candidates into performance categories. O Variation in candidate scores is not so important if candidates are still assigned to the same performance category. Therefore, it has been common to define reliability for a criterionreferenced test as the extent to which performance classifications are consistent over parallel-form administrations. O For example, it might be determined that 80% of the candidates are classified in the same way by parallel forms of a criterionreferenced test administered with little or no instruction in between test administrations. This is similar to parallel form reliability for a norm-referenced test except the focus with criterion-referenced tests is on the decisions rather than the scores.

17

THE LIVINGSTON APPROACH


O Formula for estimating the reliability of criterion-referenced

measures

+ ( ) = + ( )

O Mean score = Criterion score O Livingston coefficient = reliability by conventional norm-referenced methods O Mean score Criterion score O Livingston coefficient > reliability by conventional norm-referenced methods

O
O

WEAKNESSES

Harris SEM is the same does not imply a more dependable determination
Chester

18

THE PERCENTAGEAGREEMENT APPROACH


O Criterion-referenced & mastery test are often

administered to classify applicants/students into one or two groups (whether has mastered test content or not according to some criterion) O Simple to compute
O =

O WEAKNESSES
O Will be affected by the number of persons being

tested O Cutoff score & its closeness to the tests mean score

19

Condition Affecting Reliability Coefficients


Test Scoring
Test Content Test Administration

Personal Conditions

20

Test Scoring
O Scorer reliability is the extent to which different

observers or raters agree with one another as they mark the same set of papers O Difference between two scorers judgments O One scorer over time (fatigue) and/or halo effect

The higher the extent of agreement is, the higher the scorer reliability will be

21

Test Content
O The sample of test items is too small
O The sample of test items is not evenly

selected across material

22

Test Administration
O Instructions with the test may contain errors

that create another type of systemic error. These errors exist in the instructions provided to the test-taker. Instructions that interfere with accurately gathering information (such as a time limit when the measure the test is seeking has nothing to do with speed) reduce the reliability of a test. O Noise and surrounding O Physical condition

23

Personal Conditions
O factors related to the test-taker, such as

poor sleep, feeling ill, anxious or "stressed-out" are integrated into the test itself O Temporary ups and downs

24

THE STANDARD ERROR OF MEASUREMENT (SEM)


The formula for the standard error of measurement is

SEmeas = SD 1reliability where SD equals the standard deviation of obtained scores.


So, if you have a test with a standard deviation (SD) of

4.89, Estimate reliability of .91, the standard error of measurement would be calculated as follows: SEmeas = SD 1reliability = 4.89 SD 0.09 4.89 (0.3) = 1.467 1.47
The standard error of measurement is expressed in the

same unit as the standard deviation

25

THE RELIABILITY OF INDIVIDUAL & GROUP SCORES


O Decisions involving a single individual will require a much

higher degree of reliability than is necessary for evaluating of groups


O Truman Kelley (1927) suggested:

- Reliabilities of at least .94 for any students - Reliabilities of not less than .50 for groups of students
O Ideally, reliability should be +1.00, but in practise, test users

will have to settle for less than that.

26

THE RELIABILITY OF DIFFERENCE SCORES


O Sometimes testers want to compare differences between

pretest and posttest.


O The reliability of difference scores depends on the

reliability of each test and the correlation between them.


O The following formula from Gulliksen (1950) can be used

to estimate the reliability of difference scores.

27

THE RELIABILITY OF DIFFERENCE SCORES


Reliability of difference scores = average reliability correlation between tests 1 correlation between test

O e.g : The reliability of two tests is .90 and they correlate .80

.90 .80 Reliability of difference scores = 1 .80 = .50

28

THE RELIABILITY OF DIFFERENCE SCORES


O When the test reliability equals the correlation between

the test, the reliability of difference scores will be 0 difference scores will also equal +1.0

O When the reliability of each test is +1.0 the reliability of

O Reasons for typically low reliability is each of the two test

compared contains error of measurement.

O To increase the reliability of difference scores testers can: a) Select or construct highly reliable test b) Increase the number of items on the test

29

THE RELIABILITY OF DIFFERENCE SCORES


O When scores do not measure improvement but are merely

comparisons of two different test scores, the correlation between the two tests is usually lower than the correlation between obtain

and gain scores.

30

THE RELIABILITY OF DIFFERENCE SCORES


O Difference or gain scores, suffer from other deficiencies :
-

Gain scores do not necessarily measure improvement in


knowledge.

Students who initially perform at a high level cannot be expected to make gains equal to those made by students in the middle or lower parts.

The regression effect can account for differences among students, especially when they are initially selected on the basis of exceptionally high or low scores.

31

SOURCES OF ERROR

32

SOURCES OF ERROR

Reliability coefficients provide estimates of error that vary in

O
O

magnitude depending on the conditions allowed to influence


them.

3 major sources of error :


characteristics of students characteristics of the test conditions affecting test administration and scoring

33

CHARACTERISTICS OF STUDENTS
O The true level of a students knowledge or ability is an O O O O

example of a desired source of true variance. The outcome of a students true level of knowledge is sometimes debatable. They are usually temporary situations and undesirable condition Not under the direct control of teachers The factors related to this :

Test wiseness Illness and fatigues (sick, not enough rest) Lack of motivation ( nervous, anxiety, fear, family problem)

O Any temporary and unpredictable change in student can be

considered intra-individual error or error within test takers.

34

CHARACTERISTICS OF THE TEST


O O O O O O O

Tricky questions Ambiguous questions Confusing format Questions excessively difficult Contain too few items Include dissimilar item content Reading level that is too high

35

CONDITIONS AFFECTING TEST ADMINISTRATION & SCORING


O Physical environment (temperature, humidity, lighting,

seating conditions / arrangements, distractions)


O Instructions given to examinees
O

avoidance of

(their clarity,

complexity, consistency, ambiguity, age differences,

idiosyncratic mannerism, racial, ethnic backgrounds)


O O

Scoring Scorers made arithmetic errors, fail to understand and abide scoring criteria, mistakes in recording scores, bias etc

36

ERROR COMPONENTS OF DIFFERENTS RELIABILITY COEFFICIENTS


STABILITY EQUIVALENCE STABILITY & EQUIVALENCE
Will be increased due to changes in students items or

INTERNAL CONSISTENCY or HOMOGENEITY


Stresses on the importance of making sure that

Different

stability

Will be known whenever one form of a test be substituted for another

coefficient will be obtained over different time

guessing, item ambiguity, difficulty levels, directions and scoring criteria and methods are controlled as possible

37

WAYS TO IMPROVE NORM REFERENCED RELIABILITY


1.
2. 3. 4. 5.

6.
7. 8. 9. 10.

Increase the number of good quality items Construct easy items to reduce guessing. Construct items to measure the same trait or ability. Avoid using tricky and ambiguous questions. Prepare the test to permit objective scoring. Make sure the pictorial material is readily identifiable and do not crowd items on the page. Remember that means tend to be reliable than individual scores. High scores tend to be more unreliable than average scores Avoid the use of gain or difference scores. Make test instructions to students clear and consistent.

38

WAYS TO IMPROVE CRITERION REFERENCED RELIABILITY


1.

If reasonable, develop test in which there is a substantial difference between the test mean and the cutoff score Include as many items as possible Make sure that objectives are as specific as possible. Make sure that scoring is as specific as possible. Provide students with practise items that are similar in format to the test they will be taking. Design the test format to be clear and uncluttered Assuming score variability, use the method that improve reliability for norm-reference test.

2. 3. 4. 5.

6. 7.

39

REFERENCES
O Gilbert S. , James W.N. (1997). Principles of

Educational And Psychological Measurement And Evaluation. United States: Wadsworth Publishing. O Lammers, W. J., and Badia, P. (2005). Fundamental of Behavioral Research. California: Thomson and Wadsworth. O Reliability, http://www.psychwiki.com, retireved on 3 April 2012. O Martyn Shuttleworth (2009), http://www.experimentresources.com/test-retest-reliability.html, retrieved on 3 April 2012

40

41

O Confounding variables are variables that

the researcher failed to control, or eliminate, damaging the internal validity of an experiment.

Das könnte Ihnen auch gefallen