Beruflich Dokumente
Kultur Dokumente
Presented by: Mohd Hafiz Bin Mohd Salleh (810350) Syahrul Azrin Binti Ghafar (810398) Aza Nurain Binti Azmey (811333) Nur Zahirah Binti Othman (811562)
PRESENTATION OVERVIEW
What is reliability? Practical method of estimating reliability
The reliability of Criterion-Referenced and Mastery Tests Condition affecting reliability coefficients
The standard error of measurement Source of error Error components of different reliability coefficients The reliability of individual and group scores The reliability of difference scores Ways to improve Norm-Referenced and Criterion-Referenced Reliability
WHAT IS RELIABILITY?
O Reliability refers to the reliability of a test score
or set of test scores, not the reliability of the test. O Reliability refers to the consistency of a measure. A test is considered reliable if we get the same result repeatedly. (C. Kendra, 2012) O Reliability is not the same as validity validity asks Does a test measure what is suppose to?
score, they cannot know the amount of error for any given person. O We can estimate the effect of chance on measurements in general. Consistency of measurement means high reliability The degree of consistency can be determined by a correlation coefficient
unreliability and on this perfect test each examinee should get his or her true score. Unfortunately, we know the observed score we get was likely affected by one or more of the sources of unreliability. O So, our observed score is likely too high or to low. The difference between the observed score and the true score we call the error score; and this score can be positive or negative. O We can express this mathematically as: O True Score = Obtained Score +/- Error O T = O +/- E (or, looking at it another way, O = T +/- E)
Internal Consistency
10
ways of testing the stability and reliability of an instrument over time. (Martyn Shuttleworth , 2009) O For example, if a group of students takes a test, we would expect them to show very similar results if they take the same test a few months later. This definition relies upon there being no confounding factor during the intervening time interval. O Instruments such as IQ tests and surveys are prime candidates for test-retest methodology, because there is little chance of people experiencing a sudden jump in IQ or suddenly changing their opinions. O Educational tests are often not suitable, because students will learn much more information over the intervening period and show better results in the second test.
11
carrying out two different forms of the same test to the same individuals. (psychwiki.com, 2012) O This method is convenient to avoid the problems that come from the test-retest method. With the alternate form reliability method, an individual is tested on one form of the test, and then again on a comparable second form, the in-between time is about one week. O This method is used more than the test-restest method because it has fewer related problems, including an abundance reduction in practice effects.
12
investigation is on the consistency of scores on the same occasion and on similar content, but when conducting repeated testing or alternate forms testing is not possible.
Kuder-Richardson
a series of formulas based on dichotomously scored items
Coefficient alpha
Cronbachs (most widely used as can be used with continuous item types)
Split-half (odd-even )
Spearman-Brown correction to apply to full test (easiest to do and understand)
13
Ungrouped Data to determine rxy (rxy represents the correlation between the two halves of the scale). By doing the split-half we reduce the number of items which we know will automatically reduce the reliability, O Step 3 To estimate reliability of whole test then use the Spearman Brown correction formula
14
increases reliability.
O For standardized tests we can use:
O Look for each tests published reliability data.
determine the Standard Error of Measurement (abbreviated SEM) found in the data O See the following illustration
15
Estimating Reliability
O If we could re-administer a test to one
person an infinite number of times, what would expect the distribution of their scores to look like?
O Answer: The Bell Shaped Curve.
16
different from that of a norm-referenced test, it should not be surprising to find that the approaches used for reliability are different too. With criterion-referenced tests, scores are often used to sort candidates into performance categories. O Variation in candidate scores is not so important if candidates are still assigned to the same performance category. Therefore, it has been common to define reliability for a criterionreferenced test as the extent to which performance classifications are consistent over parallel-form administrations. O For example, it might be determined that 80% of the candidates are classified in the same way by parallel forms of a criterionreferenced test administered with little or no instruction in between test administrations. This is similar to parallel form reliability for a norm-referenced test except the focus with criterion-referenced tests is on the decisions rather than the scores.
17
measures
+ ( ) = + ( )
O Mean score = Criterion score O Livingston coefficient = reliability by conventional norm-referenced methods O Mean score Criterion score O Livingston coefficient > reliability by conventional norm-referenced methods
O
O
WEAKNESSES
Harris SEM is the same does not imply a more dependable determination
Chester
18
administered to classify applicants/students into one or two groups (whether has mastered test content or not according to some criterion) O Simple to compute
O =
O WEAKNESSES
O Will be affected by the number of persons being
tested O Cutoff score & its closeness to the tests mean score
19
Personal Conditions
20
Test Scoring
O Scorer reliability is the extent to which different
observers or raters agree with one another as they mark the same set of papers O Difference between two scorers judgments O One scorer over time (fatigue) and/or halo effect
The higher the extent of agreement is, the higher the scorer reliability will be
21
Test Content
O The sample of test items is too small
O The sample of test items is not evenly
22
Test Administration
O Instructions with the test may contain errors
that create another type of systemic error. These errors exist in the instructions provided to the test-taker. Instructions that interfere with accurately gathering information (such as a time limit when the measure the test is seeking has nothing to do with speed) reduce the reliability of a test. O Noise and surrounding O Physical condition
23
Personal Conditions
O factors related to the test-taker, such as
poor sleep, feeling ill, anxious or "stressed-out" are integrated into the test itself O Temporary ups and downs
24
4.89, Estimate reliability of .91, the standard error of measurement would be calculated as follows: SEmeas = SD 1reliability = 4.89 SD 0.09 4.89 (0.3) = 1.467 1.47
The standard error of measurement is expressed in the
25
- Reliabilities of at least .94 for any students - Reliabilities of not less than .50 for groups of students
O Ideally, reliability should be +1.00, but in practise, test users
26
27
O e.g : The reliability of two tests is .90 and they correlate .80
28
the test, the reliability of difference scores will be 0 difference scores will also equal +1.0
O To increase the reliability of difference scores testers can: a) Select or construct highly reliable test b) Increase the number of items on the test
29
comparisons of two different test scores, the correlation between the two tests is usually lower than the correlation between obtain
30
Students who initially perform at a high level cannot be expected to make gains equal to those made by students in the middle or lower parts.
The regression effect can account for differences among students, especially when they are initially selected on the basis of exceptionally high or low scores.
31
SOURCES OF ERROR
32
SOURCES OF ERROR
O
O
33
CHARACTERISTICS OF STUDENTS
O The true level of a students knowledge or ability is an O O O O
example of a desired source of true variance. The outcome of a students true level of knowledge is sometimes debatable. They are usually temporary situations and undesirable condition Not under the direct control of teachers The factors related to this :
Test wiseness Illness and fatigues (sick, not enough rest) Lack of motivation ( nervous, anxiety, fear, family problem)
34
Tricky questions Ambiguous questions Confusing format Questions excessively difficult Contain too few items Include dissimilar item content Reading level that is too high
35
avoidance of
(their clarity,
Scoring Scorers made arithmetic errors, fail to understand and abide scoring criteria, mistakes in recording scores, bias etc
36
Different
stability
guessing, item ambiguity, difficulty levels, directions and scoring criteria and methods are controlled as possible
37
6.
7. 8. 9. 10.
Increase the number of good quality items Construct easy items to reduce guessing. Construct items to measure the same trait or ability. Avoid using tricky and ambiguous questions. Prepare the test to permit objective scoring. Make sure the pictorial material is readily identifiable and do not crowd items on the page. Remember that means tend to be reliable than individual scores. High scores tend to be more unreliable than average scores Avoid the use of gain or difference scores. Make test instructions to students clear and consistent.
38
If reasonable, develop test in which there is a substantial difference between the test mean and the cutoff score Include as many items as possible Make sure that objectives are as specific as possible. Make sure that scoring is as specific as possible. Provide students with practise items that are similar in format to the test they will be taking. Design the test format to be clear and uncluttered Assuming score variability, use the method that improve reliability for norm-reference test.
2. 3. 4. 5.
6. 7.
39
REFERENCES
O Gilbert S. , James W.N. (1997). Principles of
Educational And Psychological Measurement And Evaluation. United States: Wadsworth Publishing. O Lammers, W. J., and Badia, P. (2005). Fundamental of Behavioral Research. California: Thomson and Wadsworth. O Reliability, http://www.psychwiki.com, retireved on 3 April 2012. O Martyn Shuttleworth (2009), http://www.experimentresources.com/test-retest-reliability.html, retrieved on 3 April 2012
40
41
the researcher failed to control, or eliminate, damaging the internal validity of an experiment.