Sie sind auf Seite 1von 6

Identifying Qualities Desired in Short-Answer Items 1. 2. 3. 4. Does this item measure the specified skill?

Is the level of reading skill required by this item below the students? Can this item be answered using one or two words or a short sentence? Will only a single or every homogeneous set of responses provide a correct response to the item? 5. Does the item use grammatical structure and vocabulary different from that contained in the source of instruction? 6. Does the format of the item (and the test in general) allow for efficient scoring? 7. If the item requires a numerical response, does the question state the unit of measurement to be used in the answer? 8. Does the blank represent a keyword? 9. Are blanks provided at or near of the item? 10. Is the number of blanks sufficiently limited? 11. Is the physical length of each blank the same?
ESSAY ITEMS
Criteria for Evaluating Essay Items 1. 2. 3. 4. Does the item measure the specified skill? Is the level of recording skill required by this item below the students ability? Will all or almost all of the students answer this item in less than ten minutes? Will the scoring plan result in different readers assigning similar scores given to students response?

To facilitate reliable scoring, the plan should incorporate the three characteristics: First, the total number of points in the item is worth must be satisfied. The points associated with each essay item should be proportional to the relative importance of the skill is being tested. The importance of a test item should not be equated with the amount of time students need to answer the item. Second, the attributes to be evaluated must be specified. Third, pertains to how points will be awarded. A scoring plan must be precise enough for the reader to know when to and when not to award a point.
5. Does the scoring plan describe a correct and complete response? 6. Is the item written in such a way that the scoring plan will be obvious to knowledgeable students?

SCORING STUDENTS RESPONSES TO EASY ITEMS One common approach to holistic scoring involves placing a students paper into one of three groups, representing low, medium, high categories of judged qualities. Often, the reader re-evaluates students answers within each group to subdivide response into additional categories. The advantage of holistic scoring that are it (1) is relatively fast and (2) can be used with the items for which answers cannot be subdivided into components. Weaknesses of holistic scoring are that (1) usually results in less reliable scores and (2) provides limited information to the students as to why an answer judged appropriate or not so judged. In general then, essay items have some advantages over other written-test formats. They tend to measure targeted behaviors more directly, and they facilitate examination of students ability to communicate ideas in writing. Essay questions much more readily assess intellectual skills. Essay items have basic limitations. They usually provide a less adequate sampling content to be assessed, they are less reliably scored and they are more time-consuming to score.

MATCHING TYPE Strengths Allows the comparison of related ideas, concepts, or theories Efficient means of assessing the association between a variety of items within a given topic or category Encourages the integration of information Preferable item type when multiple-choice assessment repeatedly utilizes the same response options Relatively easy and quick to score Objective nature limits bias in scoring Easily administered to large numbers of students Limits biased caused by poor writing skills Weaknesses Difficult to generate sufficient number of plausible premises Not effective in testing isolated facts or bits of information May limit assessment to lower levels of understanding Only useful when there is a sufficient number of related items May overestimate learning due to the influence of guessing

ALTERNATE-CHOICE ITEMS: TRUE-FALSE TYPE AND OTHER VARIATIONS An alternate choice item present a proposition for which one of two opposing options represents the correct response. An alternate choice item always come in pairs. VARIATIONS OF ALTERNATE CHOICED ITEMS
1. Traditional True-False Items A single statement is made usually consisting of one sentence. Students are asked to indicate whether the statement is true or false. 2. True-False Items Requiring Corrections Asking students to rewrite false items as correct statements. 3. Embedded Alternate-Choice Items Each item consists of an underline word or group of words. Students are asked to indicate whether each underlined element represents a particular quality such as a correctly spelled or a factually correct statement. 4. Multiple True-False Items A conventional multiple-choice item consists of a stem and a list of options. Usually, one of the options is correct, and the other serves as distracters. When a group of items shares a common stem in this manner, they are called multiple true -false items. 5. Sequential True-False Items The correct response to each item being dependent on condition specified in the previous item. Sequential true-false items can be used in a number of settings where solution of a problem requires a series of steps, each, providing information to the next stage. 6. Focused Alternate-Choice Items

RELIABILITY By definition, reliability is the degree to which a test measures something consistently. Attest can be reliable without being valid. A valid test is automatically reliable but a test cannot be valid if not reliable. The fact reliability is a prerequisite to validity is one reason it is important to understand both reliability and validity. One cannot see reliability. Instead, we depend on various indicators of reliability. COMMON THREATS TO RELIABILITY 1. Inconsistence Between Earlier and Later Measures If students are administered tests before and after instruction, their performance normally changes. This inconsistency is desirable. This inability of the aptitude test to provide results that are stable over time is often considered undesirable. A source of inconsistency between earlier and later measures must be evaluated to determine whether it is undesirable. If it is undesirable, it is a threat to reliability. Otherwise, it is not relevant to the reliability of the measure. 2. Inconsistencies Between Test Items That Supposedly Measure the Same Skill Were student performance consistent from item to item, more than one test item would be redundant. Instead, however, we use several items to average out the inconsistency and increase our confidence in the measure. One reason is that the students often guess at the correct answer and their guesses are inconsistently lucky. Another reason for inconsistent performance across items is that many test items pose vague questions. 3. Inconsistencies Between Alternate Skills in the Same Content Domain For instance, when assessing student ability to deliver a persuasive speech, it is impractical and inappropriate to divide the delivery of a speech into each of its hundreds of component skills. 4. Inconsistencies from Measuring Unrelated Qualities Within One Test Using one score to report performance on unrelated qualities lowers the reliability or test scores. As with the combined fuel and temperature gauge, combining unrelated factors into a single test score reduces the internal consistency and the usefulness of the score. 5. Inconsistencies Between Different Raters in the Scoring of Student Responses Although often not practical, it is beneficial to have more than one rater read students papers, review portfolios, observe students during performance assessments, and even informally observe students in class. 6. Inconsistencies in Decisions Based on Student Performance Sometimes a passing score is established for a test. With standardized tests, passing scores are sometimes established to make decisions such as whether a student is eligible to receive a high school diploma or to be admitted to a particular college. Misclassifications will more commonly occur among students whose true performance is close to the passing score.

CONVENTIONAL METHODS FOR ESTIMATING RELIABILITY OF TEST SCORES 1. Test-Retest Method Possibly the most obvious method for judging a test measures something consistently is to re-administer the test to the same students. 2. Alternate-Form Method Another strategy for estimating reliability involves administering two forms or versions of test to the same students. This results in each student obtaining two scores, one of each test form. The correlation coefficient between the students two scores is then calculated. 3. Methods of Internal-Consistency Reliability The methods described in this section are based on the administration of a single test and the use of component subtests-information internal to the test- to estimate testscore reliability. 3.1 Split-Half Method The correlation coefficient is then calculated between students scores on the respective halves of the tests. 3.2 Kuder-Richardson Method These are applicable to tests scores dichotomously where 1 point is for the correct answer and zero for the wrong answer. 3.3 Cronbach Coefficient Alpha Coefficient alpha provides a reliability estimate for a measure composed of items scored with values other than 0 and 1. 4. Inter-Rater Method Two raters who separately review the students work may come up with very different judgments of each students performance. The inter-rater method can be used to detect this inconsistency. Basically, two or more teachers independently score each students performance and obtain two scores for each student. The correlation coefficient is then computed between the teachers scores. TECHNIQUES FOR IMPROVING RELIABILITY 1. Improving the Quality of Observations When a test presents a vague task, students reactions and responses are inconsistent. Ambiguities can be reduced by a variety of means, depending on the test format. 2. Improving the Scoring Performances The scoring process can pose a serious threat to reliability, particularly for some formats, such as the essay items. Two raters may assign quite different score to the same paper. 3. Increasing the Number of Observations One method is to include more items in a test. Another is to have more than one person score each assessment. Still another is to combine the observations of individual students to obtain a measure of performance of a group of students. 4. Expanding the Breathe of Observations DIFFERENCE BETWEEN RELIABILITY AND VALIDITY An assessment cannot have a high degree of validity unless its reliability is also high. Recall that validity is concerned with whether the test measures what it is supposed to measure. Validity is critical because a test is of no use unless the capability being evaluated is in fact being measured. Reliability therefore represents a very important attribute, because it is a prerequisite to validity.

ITEM ANALYSIS Item analysis is a given name to a variety of statistical techniques designed to analyze individual items on a test after the test has been given to a group of examinees. The analysis involves examining class-wide performance on individual test items. Item analysis sometimes suggests why an item has not functioned effectively and how it might be improved. A test composed of items revised and selected on the basis of item-analysis data is almost certain to be much more reliable than one composed of an equal number of untested items. There are three common types of analysis which provide teachers with three different types of information: Difficulty index- this is the first information to be generated in an item analysis. This represents the portion of students who answered the test item correctly. Discrimination index- the discrimination index is a basic measure of validity of an item. It is a measure of an items ability to discriminate between those who scored high on the total test and those who scored low. Though there are several steps in its calculation, once computed, this index can be interpreted as an indication of the extent to which ove4rall knowledge of the content area or mastery of the skills is related to the response on an item. Perhaps, the most crucial validity standard for a test item is that whether a student got an item correct or not is due to their level of knowledge or ability and not due to something else such as chance or test bias. Analysis of Response Options/Distracter Analysis- in addition to an entire test item, teachers are often interested in examining the performance of individual distracters(incorrect answer options) on multiple-choice items. By calculating the proportion of students who chose each answer option, teachers can identify which distracters are working and appear attractive to students who do not know the correct answer, and which distracters are simply taking up space and not being chosen by many students. To eliminate blind guessing which results in a correct answer purely by chance (which hurts the validity of a test item), teachers want as many plausible distracters as is feasible. Analyses of response options allow teachers to fine tune and improve items they may wish to use again with future classes.

Das könnte Ihnen auch gefallen